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ADVANCED  TELEPROCESSING  SYSTEMS 


Defense  Advanced  Research  Projects  Agency 
Semi  •  Annual  Technical  Report 


March  31,1985 
September  30, 1985 


INTRODUCTION 

This  Semi  *  Annual  Technical  Report  covers  research  earned  out  by  the  Advanced  Teleprocessing 
Systems  Group  at  UCLA  under  DARPA  Contract  No.  MDA  903-82-C-0064  covering  the  period 
from  October  1, 1984  to  September  30,  1985.  Under  this  contract  we  have  three  designated  tasks 
as  follows: 


TASK  I.  DISTRIBUTED  COMMUNICATIONS  ACCESS 

The  general  problem  of  sharing  a  multi-access  broadcast  distributed  sys¬ 
tems  among  a  set  of  competing  users  will  be  studied.  General  issues  in¬ 
volving  exhaustive  communications,  start-up  problems  and  refined 
models  to  manifest  some  more  realistic  phenomena  in  these  systems  will 
be  studied.  Applications  to  packet  radio  systems  and  large  survivable 
networks  involving  the  study  of  tandem  networks,  multi-hop  networks, 
one-way  communication  links,  correct  reception  of  more  than  one  simul¬ 
taneous  transmission  and  mobility  will  be  included.  Further  applications 
will  include  the  study  of  very  high  bandwidth  channels  and/or  very  long 
propagation  delay  systems,  multiple  token  systems  and  compound 
hierarchical  network  structures. 


TASK  II.  DISTRIBUTED  PROCESSING 

The  interplay  between  distributed  communications  in  a  broadcast  en¬ 
vironment  and  processing  of  distributed  data  will  be  studied.  For  exam¬ 
ple,  the  effect  of  merging  sotted  lists  in  a  broadcast  environment,  as  well 
as  finding  properties  of  elements  in  these  lists,  will  be  studied.  Con¬ 
currency  in  multiprocessor  systems  will  be  studied  in  order  to  investigate 
performance  in  terms  of  response  time  and  speedup  factors  for  various 
graph  models  of  computation.  Connection  architectures  for  multiproces¬ 
sor  systems  will  be  investigated  as  well.  One  application  here  is  the 
structure  of  the  processing  and  communication  architecture  for  super¬ 
computers. 


TASK  III.  DISTRIBUTED  CONTROL  AND  ALGORITHMS 

Routing,  flow  control  and  survivability  in  large  packet  radio  networks  as 
well  as  in  public  data  networks  will  be  studied  as  control  algorithms  in  a 
distributed  environment.  Measures  of  performance,  including 
throughput,  response  time,  blocking,  power,  fairness,  and  robustness  will 
be  applied  to  these  systems.  Distributed  algorithms  for  finding  shortest 
paths,  connectivity,  loops,  etc.  will  be  studied.  The  effect  of  node  and 
link  failures,  limited  amounts  of  memory  at  each  node  and  restricted 
channel  capacity  for  communications  will  be  investigated.  The  effect  of 
network  failures  and  delays  oti  distributed  data  base  management  systems 
will  also  be  studied. 


A  number  of  papers  reporting  our  research  results  were  published  during  this  period  and  they  are 
listed  below  under  "Research  Publications".  The  total  output  consists  of  one  Ph.D.  dissertation, 
four  M.S.  theses,  and  twelve  published  papers. 

These  research  publications  cover  a  broad  spectrum  of  results  in  the  areas  of  random  access  com¬ 
munications,  computer  networks,  and  distributed  processing  (  a  new  and  rapidly  growing  area  of 
research  and  practical  importance).  In  this  latter  area,  the  major  works  involved  the  behavior  of 
distributed  algorithms  for  election  and  traversal  in  networks,  and  the  evaluation  of  the  achievable 
and  achieved  parallelism  in  parallel  processing  systems.  A  general  model  of  distributed  process¬ 
ing  exposed  a  potential  problem  in  the  performance  of  distributed  systems.  In  the  more  traditional 
areas  of  networks  and  multiaccess,  our  results  focus  on  improved  and  innovative  models. 

A  major  contribution  of  our  research  during  this  reporting  period  is  contained  in  Reference  9  list¬ 
ed  below,  namely,  "Distributed  Algorithms  for  Section  in  Unidirectional  and  Complete  Net¬ 
works",  by  Yehuda  Afek.  This  dissertation  was  supervised  by  Professor  Leonard  Kleintock  (Prin¬ 
cipal  Investigator  for  this  research  contract).  This  entire  dissertation  forms  the  body  of  the  report 
herein.  The  abstract  of  that  work  follows: 


Consider  a  data  communication  network  of  n  nodes,  each  of  which  has  a 
unique  identifier  (id);  otherwise  the  nodes  are  identical.  The  nodes  are 
asleep  and  have  no  global  information  about  network  topology,  number 
and  ids  of  other  nodes,  etc.  A  distributed  election  algorithm  is  a  means 
by  which  the  nodes  of  the  network  distinguish  one  among  them  as  the 
leader. 

The  problem  of  distributively  electing  a  leader  in  a  network  is  viewed  as 
a  problem  of  synchronization  among  potential  candidates  for  leadership. 
Each  candidate  tries  to  capture  all  the  nodes.  To  guarantee  that  only  one 
succeeds,  all  but  one  candidate  are  killed.  Following  this  view  election 
algorithms  in  a  general,  two  component  framework  are  designed.  Com¬ 
ponent  one  is  a  capturing  and  termination  detection  mechanism,  assum¬ 
ing  only  one  candidate.  Component  two  is  a  synchronization  mechanism, 
to  eleminate  all  but  one  candidate. 


In  arbitrary  networks  the  synchronization  is  complicated  by  the  uncer¬ 
tainties  of  nodes  about  the  network  topology  and  the  relative  location  of 
candidates.  Two  network  models  are  considered:  first,  a  complete  net¬ 
work  in  which  a  bidirectional  communication  link  connects  every  node 
with  every  other,  thus  eliminating  topological  uncertainties;  and  second, 
the  opposite  extreme  in  which  topological  uncertainties  are  at  maximum  - 
a  strongly  connected  unidirectional  network  with  some  or  all  links 
transmitting  messages  in  one  direction  only. 

The  study  produces  an  O(n-log«)  messages  O(logn)  time  synchronous  and 
0(n*logn )  messages  O(n)  time  asynchronous  election  algorithm  in  com¬ 
plete  networks.  For  unidirectional  networks  we  derive  a  distributed  elec¬ 
tion  algorithm  whose  communication  complexity  is  0(n-|£|+n2logn) 
bits,  where  I  £  I  is  the  total  number  of  links. 

We  also  establish  that  Q(n-logn)  is  a  lower  bound  on  the  total  number  of 
messages  transmitted  for  achieving  election  in  synchronous  complete 
networks.  Moreover,  it  is  shown  that  the  time  complexity  of  message- 
optimal  synchronous  algorithms  is  Q(logn),  hence  the  optimality  of  our 
synchronous  complete  network  algorithm.  It  remains  open  whether  a 
sublinear  time,  message-optimal,  asynchronous  complete  network  elec¬ 
tion  algorithm  exists. 


The  following  list  of  research  publications  summarizes  the  results  of  this  reporting  period  and  the 
abstract  of  each  paper  is  given  along  with  the  reference  itself.  Following  this  list  is  the  body  of 
Dr.  Afek's  dissertation. 


RESEARCH  PUBLICATIONS 


1.  Rosenberg,  C.  P. "  Exponential  Queueing  Systems  with  Randomly  Changing  Arrival 
and  Service  Rates,"  1984. 

There  exists  a  large  number  of  situations  in  which  the  input  or  service 
variations  are  not  known  deterministically.  A  typical  example  is  a  com¬ 
munication  network  with  a  sudden  unpredictable  increase  in  the  traffic 
due  to  an  external  phenomenon  or  an  unpredictable  breakdown  of  a 
server.  Random  intensity  models  are  natural  for  modeling  such 
phenomena. 

We  introduce  two  models  with  randomly  changing  arrival  and  service 
rates,  which  do  not  obey  a  certain  independence  assumption  often  made 
in  Queueing  Theory.  A  complete  analysis  of  the  two  models  is  carried 
out  and  explicit  results  are  given.  Jury’s  criteria  are  used  to  find  Neces¬ 
sary  and  Sufficient  Conditions  for  stability. 


2. 


Green,  J.  J.  "Analysis  of  the  Time  Stamp  Queue,"  1984. 


In  this  paper  we  analyze  the  model  of  the  time  stamp  queue.  The  model 
approximates  the  behavior  of  a  distributed  simulation  protocol,  known  as 
the  Link  Time  Algorithm.  The  analysis  shows  that  the  infinite  memory 
version  of  the  protocol  is  inherently  unstable  (i.e.  only  one  of  the  queues 
will  remain  finite).  The  instability  property  is  shown  to  hold  for  a  proces¬ 
sor  waiting  on  two  or  more  queues.  A  comparison  of  output  rates 
between  one,  two,  and  three  processor  models  is  also  provided.  The  com¬ 
parison  shows  that  dividing  up  a  finite  capacity  among  a  number  of  pro¬ 
cessors  yields  an  output  rate  that  is  lower  than  the  rate  from  a  single  pro¬ 
cessor  with  the  same  total  capacity.  This  result  is  analogous  to  the 
resource  sharing  phenomena  in  communication  systems.  Finally,  we  at¬ 
tempt  to  establish  conditions  for  stability  in  the  N  queue  Time  Stamp 
model  with  different  decision  policies.  We  show  that  if  a  processor  waits 
on  subsets  of  the  N  queues,  then  there  may  exist  a  stable  system  with  the 
proper  choice  of  parameters.  Whereas  if  the  processor  waits  on  all  N 
queues  then  only  one  will  remain  finite,  and  the  rest  will  be  unbounded. 


3.  Huang,  J.  H.  "Throughput  Analysis  for  Certain  Multi-access  Protocols  with  Imper¬ 
fect  Sensing,”  1985. 

When  using  a  BTMA  (Busy  Tone  Multi-Access)  protocol  in  a  packet  ra¬ 
dio  network,  we  cannot  ignore  the  probability  of  mis-detecting  the  busy 
or  idle  state  of  the  channel.  Similarly  with  a  CSMA  (Carrier  Sense 
Multi-Access)  protocol,  if  the  sensor  is  not  perfect,  we  must  consider  the 
case  when  the  channel  state  is  mis-detected.  This  thesis  focuses  on  these 
problems  and  tries  to  evaluate  the  throughput  behavior  in  a  p-persistent 
access  schemes  using  an  iterative  approximation  method.  Lastly,  we  sug¬ 
gest  a  new  protocol  which  can  effectively  cope  with  the  imperfect  sens¬ 
ing  property  to  achieve  an  improved  throughput 


4.  Kleinrock,  L.  ”  On  the  Theory  of  Distributed  Processing,"  presented  at  and  published 

in  the  Proceedings  of  the  Twenty-Second  Annual  Allerton  Conference,  October  1984. 

We  consider  a  distributed  processing  environment  in  which  a  total  pro¬ 
cessing  capacity  is  split  into  smaller  processing  units  (of  the  same  total 
capacity),  which  collectively  process  a  stream  of  jobs.  We  study  the  per¬ 
formance  ratio  T  (the  mean  response  time  seen  by  jobs  in  this  distributed 
environment)  with  T0  (the  mean  response  time  seen  by  a  job  when  it  is 
processed  in  a  centralized  environment  by  a  single  processor).  The  most 
general  configuration  studied  is  that  of  a  series-parallel  topology.  In  par¬ 
ticular,  we  consider  m  parallel  chains,  the  k*  of  which  contains  nk  pro¬ 
cessors  in  series,  each  of  capacity  Ck  operations/second.  We  asume  that 
jobs  arrive  at  the  k Ul  chain  from  a  Poisson  source  at  rate  A*  jobs/second 
and  that  each  job  requires  an  exponentially  distributed  number  of  opera¬ 
tions  from  each  processor.  We  find,  for  the  symmetric  system 
(nk  =  n,  A*  =  A/m),  that  7'(p)/7'0(p)  =  ro»-  For  the  general  system  (arbitrary 
nk)  but  with  equal  loading  on  each  series  chain,  we  show  that 

7'(P)/7'o(P)  =  £  =  «t- 

i 


We  find  the  optimal  distribution  of  traffic  among  the  chains;  one  property 
of  this  solution  is  that  some  of  the  series  chains  carry  zero  traffic.  When 
we  optimize  the  capacity  assignment,  we  find  that 
min  nk  £  T(p)/T0(p)  £  When  we  do  the  joint  optimization,  we  find 

that  r(p)/T0(p)  =  min  nk  .  Distributed  processing  increases  the  mean 
response  time!  Lastly,  we  discuss  the  effect  of  processor  costs  on  system 
performance. 

Kleinrock,  L. "  On  Queueing  Problems  in  Random-Access  Communications,"  invit¬ 
ed  paper  for  IEEE  Transactions  on  Information  Theory,  Special  Issue,  Vol.  IT-31,  No. 

2,  March  1985,  pp.  166-175. 

The  general  problem  of  allocating  the  capacity  of  a  communication  chan¬ 
nel  to  a  population  of  geographically  distributed  terminals  is  considered. 

The  main  focus  is  on  the  queueing  problems  that  arise  in  the  analysis  of 
random  access  resolution  algorithms.  The  performance  measures  of  in¬ 
terest  are  the  channel  efficiency  and  the  mean  response  time.  The  nature 
of  known  solutions  for  various  random  access  schemes  is  discussed  and  a 
lower  bound  for  the  mean  response  rime  is  conjectured. 


Gerla,  M.,  Chan,  H.  W.  and  Boisson  de  Marca,  "Fairness  in  Computer  Networks," 
Proceedings,  IEEE  International  Conference  on  Communications,  (Chicago,  June  23-26, 
1985),  Vol.  3, 1985,  pp.  1384-1389. 

The  success  of  computer  networks  is  based  on  the  efficient  sharing  of 
resources  among  several  users.  Network  protocols  (e.g.  routing,  flow 
control,  etc.)  have  been  traditionally  designed  to  enhance  the  sharing  and 
optimize  overall  system  performance,  yet  avoiding  pitfalls  (e.g.  conges¬ 
tion,  deadlocks,  etc.).  A  key  performance  criterion  in  this  optimization  is 
fairness,  i.e.,  the  ability  to  provide  equal  satisfaction  to  all  users,  vis-a-vis 
their  demands  and  the  resources  avilable  in  the  network. 

In  this  paper  we  present  a  review  of  several  fairness  criteria  proposed  for 
wide-area,  packet-switched  computer  networks.  A  taxonomy  is  proposed 
and  is  used  to  classify  and  compare  the  various  schemes. 


Belghith,  A.  and  L.  Kleinrock, "  Analysis  of  the  Number  of  Occupied  Processors  in  a 
Multi-Processing  System,"  UCLA,  Computer  Science  Department  Report  No. 
850027,  August  1985. 

We  consider  a  multi-processor  system  consisting  of  a  set,  say  P,  of  identi¬ 
cal  processors.  A  computer  job  is  represented  as  a  set  of  tasks  partially 
ordered  by  some  precedence  relationships  and  represented  as  a  directed 
acyclic  graph  called  Process  Graph.  In  such  a  graph,  nodes  represent 
tasks  and  edges  represent  precedence  relationships  between  these  tasks. 

Many  parameters  are  in  play  to  characterize  the  terrain  of  our  multi¬ 
processing  system.  These  are:  the  job  arrival  process,  the  process  graph 
description  (number  of  nodes,  number  of  levels,  number  of  tasks  and 
their  distribution  among  the  levels,  and  the  precedence  relationships 
among  the  tasks),  the  task  processing  requirement,  and  the  number  of 
processors  P  in  the  system. 


In  this  report,  we  investigate  the  distribution  of  the  number  of  occupied 
processors,  the  average  and  variance  of  such  number  of  occupied  proces¬ 
sors,  and  the  probability  distribution  of  the  interarrival  times  between 
tasks  to  the  system.  We  find  that  the  average  number  of  occupied  proces¬ 
sors  in  the  system  is  independent  of  the  number  of  levels  in  the  process 
graph,  the  placement  of  tasks  among  levels,  the  precedence  relationships 
among  the  tasks,  the  distribution  of  task  service  time  requirement,  the 
distribution  of  the  job  arrival  process  and  the  number  of  available  proces¬ 
sors  in  the  multi-processing  system.  Such  average  number  of  occupied 
processors  in  the  system  is  found  to  be  solely  a  function  of  the  average 
number  of  tasks  per  job,  the  average  rate  of  the  job  arrival  process  and  the 
average  task  service  time. 


Gerla,  M.  and  H.  W.  Chan  "Window  Selection  in  Flow  Controlled  Networks," 
Proceedings  of  the  Ninth  Data  Communications  Symposium,  Vancouver,  B.C.,  Canada, 
September  1985. 

The  end-to-end  window  scheme  is  a  popular  mechanism  for  flow  and 
congestion  control  in  packet  switched  networks.  The  window  scheme 
may  be  implemented  in  the  transport  layer  protocol  (like  in  TCP),  or  in 
the  source-to-destination  protocol  (like  in  ARPANET),  or  in  the  network 
layer  protocol  (like  X.25).  In  many  implementations,  the  window  size  is 
chosen  at  connection  set  up  time. 

In  this  paper,  we  provide  guidelines  for  window  selection.  Specifically, 
we  show  that  if  the  network  becomes  overloaded,  that  is,  the  offered  load 
exceeds  network  capacity,  then  the  selection  of  user  windows  has  a  criti¬ 
cal  impact  on  individual  user  throughputs.  Thus,  user  windows  should  be 
chosen  judiciously,  so  as  to  satisfy  a  well  defined  "fairness"  criterion. 

We  formulate  the  optimal  window  assignment  as  a  mathematical  pro¬ 
gramming  problem,  and  show  that  the  exact  solution  is  computationally 
impractical  because  of  the  combinatorial  nature  of  the  problem  and  the 
complexity  of  the  underlying  multiple  chain,  closed  network  of  queue 
model.  We  then  develop  a  heuristic  approach  which  is  computationally 
very  efficient  and  provides  nearly  optimal  solutions.  Numerical  results 
are  provided  to  illustrate  and  validate  the  method. 
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The  rapid  growth  of  computer  networks  and  their  applications  in  resource 
sharing,  data  distribution  and  exploiting  parallelism  in  complex  calculations  have 
increased  the  demand  for  distributed  network  control  algorithms.  To  enable  reliable 
and  efficient  use  of  networks,  their  computers  have  to  be  coordinated  to  coopera¬ 
tively  achieve  common  global  objectives.  In  an  effort  to  accomplish  this,  many  dis¬ 
tributed  algorithms  have  been  developed  over  the  last  decade. 

The  solution  to  the  election  problem,  i.e.,  the  problem  of  distributively  distin¬ 
guishing  one  computer  from  all  the  others  is  a  basic  building  block  in  many  distri¬ 
buted  algorithms  and  systems.  For  example,  it  is  used  to  replace  a  faulty  coordinat¬ 
ing  center  in  distributed  algorithms,  such  as  a  routing  center  in  routing  algorithms,  a 
lock-coordinator  in  a  distributed  data-base,  or  a  primary  site  in  a  replicated  distri¬ 
buted  file  system. 

In  this  thesis  we  study  distributed  algorithms  for  election  in  two  models  of 
data  communication  networks.  First,  we  study  a  complete  network  in  which  every 
node  is  connected  to  every  other  node.  This  network  is,  topologically,  the  most  sim¬ 
ple  one,  thus  revealing  basic  principles  of  distributed  election  algorithms  when  the 
topological  uncertainties  are  removed.  Second,  we  study  arbitrary-topology, 
strongly-connected  unidirectional  networks  in  which  some  or  all  the  links  can 
transmit  messages  only  in  one  direction.  Unlike  the  complete  network,  the  unidirec¬ 
tional  network  topology  is  the  most  general  one,  since  every  other  network  topology 
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can  be  modeled  as  a  unidirectional  network.  The  study  of  these  two  models  provides 
insight  into  the  basic  elements  of  the  design  of  distributed  algorithms  in  general,  and 
election  algorithms  in  particular. 

1.1.  Data  Communication  Networks 

A  data  communication  network  (network,  in  short)  consists  of  a  set  of  auto¬ 
nomous  processors  (nodes)  connected  by  communication  lines  (links).  Each  auto¬ 
nomous  computer  has  its  own  memory  and  is  capable  of  carrying  out  its  own  local 
computations  regardless  of  the  status  of  any  other  computer  in  the  network.  Each 
communication  line  connects  a  distinct  pair  of  nodes,  thus  enabling  these  two  nodes 
to  exchange  messages.  Message  exchange  is  the  only  form  of  communication 
between  the  nodes  of  the  network. 

An  example  of  a  data  communication  network  is  the  ARPANET  whose  nodes 
are  dispersed  throughout  North  America  and  Europe.  Some  of  today’s  super¬ 
computers,  are  examples  of  a  network  of  micro-computers,  all  situated  in  one  room 
(e.g.  the  Cosmic  Cube  [Sei85] ). 

Computer  networks  can  be  used  to  facilitate:  (1)  resource  sharing;  (2)  data 
distribution  and;  (3)  the  exploitation  of  parallelism  in  complex  calculations.  Exam¬ 
ples  of  such  tasks  are:  controlling  a  telephone  system,  connecting  branches  of  a 
bank,  distributed  data  base  systems  and  distributed  simulation.  To  perform  these 
tasks,  the  nodes  of  the  network  are  coordinated  to  achieve  cooperation  in  solving  a 
common  problem.  The  objective  of  most  distributed  algorithms  is  to  control  and 
coordinate  the  nodes  of  the  network.  These  algorithms  are  then  used  as  building 
blocks  in  the  implementation  of  these  distributed  systems.  Distributed  algorithms 
efficiently  solve  problems  such  as:  finding  all  shortest  paths  in  the  network,  distin- 
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guishing  a  unique  node  from  all  the  others  and  constructing  minimum  weight  span¬ 
ning  trees.  Coordinating  the  nodes  of  a  network  is  the  major  task  of  distributed  algo¬ 
rithms. 

1.2.  Distributed  Algorithms 

A  distributed  algorithm  is  a  means  by  which  the  nodes  of  a  network 
cooperatively  achieve  a  common  objective.  The  algorithm  itself  is  a  collection  of 
identical  programs,  one  copy  in  each  node  of  the  network.  To  perform  the  algo¬ 
rithm,  the  programs  communicate  with  each  other  via  message  exchange.  Unlike 
centralized  algorithms,  the  execution  of  a  distributed  algorithm  can  be  started  by  any 
subset  of  the  nodes  at  any  time.  Although  started  by  a  few  nodes,  the  algorithm  has  a 
unique  objective,  and  all  the  nodes  participating  in  the  algorithm  are  coordinated  to 
efficiently  achieve  that  objective.  Thus,  in  a  distributed  algorithm,  each  node  of  the 
network  performs  part  of  the  total  computation  required  to  achieve  the  algorithm’s 
objective. 

Before  a  distributed  algorithm  starts  executing,  the  nodes  of  the  network  are 
assumed  to  have  only  local  information  about  the  network.  Since  networks  are  very 
large  and  frequently  change,  no  global  knowledge  is  assumed  at  any  one  node.  Ini¬ 
tially,  no  node  knows  the  total  number  of  nodes  in  the  network  or  the  network  topol¬ 
ogy. 

Nodes  start  their  participation  in  a  distributed  algorithm  in  two  ways,  either 
by  being  spontaneously  awakened  at  an  arbitrary  time,  in  which  case  it  is  called  an 
initiator,  or  by  receiving  a  message  of  the  algorithm.  The  spontaneously  awakened 
nodes  are  awakened  by  an  attached  host,  or  a  user  operator,  or  some  other  event 
which  is  external  to  the  network. 


Unlike  centralized  algorithms,  distributed  algorithms  exhibit  two  forms  of 
non-determinism.  First,  an  execution  of  the  algorithm  may  be  started  by  any  subset 
of  the  processors  at  different  times.  Second,  once  started,  the  distributed  algorithm 
progresses  in  a  non-deterministic  fashion.  At  any  given  time,  neither  the  location 
nor  the  time  of  arrival  of  the  next  message  is  known. 

There  are  two  simple,  straightforward  approaches  in  the  design  of  distributed 
algorithms:  broadcasting  and  semi-centralized  algorithms.  With  broadcasting,  all 
information  required  to  solve  the  problem  is  broadcast  throughout  the  network. 
Each  node  then  employs  a  centralized  algorithm  to  solve  the  problem.  In  semi- 
centralized  algorithms,  a  particular  node  is  selected  ahead  of  time  to  synchronize  and 
coordinate  the  processors  of  the  network.  This  central  node  collects  all  the  required 
information  about  the  network,  such  as  the  topology,  and  uses  centralized  algorithms 
to  solve  the  problem  locally  and  distribute  the  results  to  all  the  other  nodes. 

Considering  that  a  new  broadcast  must  be  initiated  for  every  node  or  link 
failure  in  order  to  update  the  other  nodes  of  the  change,  the  broadcasting  solution  is 
impractical.  The  result  of  this  will  be  a  huge  flow  of  messages,  which  will  degrade 
the  performance  of  any  network,  in  particular  of  large  networks  where  a  high  fre¬ 
quency  of  failures  is  expected. 

The  semi-centralized  approach  has  three  drawbacks.  First,  the  central  proces¬ 
sor  becomes  a  critical  element  of  the  network.  The  correct  and  reliable  operation  of 
the  entire  network  then  depends  on  the  reliability,  availability  and  correct  function¬ 
ing  of  one  node.  Second,  the  central  processor  serializes  the  operation  of  the  distri¬ 
buted  algorithm  in  an  environment  specifically  intended  to  support  parallelism. 
Third,  the  central  processor  and  its  neighborhood  will  be  swamped  with  such  mes¬ 
sages  as  topology  updates  and  service  requests.  The  second  and  third  problems 
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exacerbate  each  other  since,  consequently,  the  whole  network  operates  at  the  rate  of 
the  central  processor.  Henceforth,  we  will  not  consider  broadcasting  or  semi- 
centralized  solutions  in  this  work.  Rather,  we  will  consider  algorithms  in  which 
every  node  performs  only  part  of  the  total  computation.  Before  these  algorithms 
start  no  node  is  distinguished  to  play  any  special  role  in  the  algorithm.  The  amount 
of  communication  in  the  algorithms  that  we  consider  is  considerably  less  than  that  in 
the  broadcasting  solution. 

1.3.  Election  and  Traversal 

This  dissertation  addresses  two  problems  in  computer  networking:  (1)  the 
problem  of  distributively  electing  a  leader,  and  (2)  the  traversal  problem.  We 
present  distributed  algorithms  for  solving  these  problems  in  complete  networks  (in 
which  every  node  is  connected  to  every  other  node)  and  unidirectional  networks  (in 
which  some  or  all  the  links  can  transmit  messages  only  in  one  direction). 

1.3.1.  The  Election  Problem 

In  the  election  problem,  a  single  node,  called  the  leader,  is  to  be  selected 
from  a  set  of  nodes  which  differ  only  by  their  identifiers  (ids ).  Initially,  no  node  is 
aware  of  all  the  other  ids.  In  the  distributed  election  algorithm  an  arbitrary  subset  of 
nodes  wakes  up  spontaneously  at  arbitrary  times  and  starts  the  algorithm  by  sending 
messages  over  the  network.  When  the  message  exchange  terminates,  a  leader  is  dis¬ 
tinguished  from  all  other  nodes. 

1.3.2.  The  Traversal  Problem 

In  the  traversal  problem,  one  node,  called  the  root ,  initiates  a  single  process 
(which  can  be  viewed  as  a  token)  which  must  visit  all  the  nodes  in  the  network,  one 


at  a  time.  If  necessary,  the  process  may  traverse  any  link  or  visit  any  node  several 
times. 


Since  every  node  is  assumed  to  have  knowledge  only  of  its  own  incident 
links,  a  traversal  algorithm  has  (1)  to  detect  the  time  when  it  has  seen  all  the  nodes, 
and  (2)  to  efficiently  reach  the  nodes  not  yet  visited.  In  order  to  do  this,  the  traversal 
process  marks  nodes  visited  and  carries  along  some  information. 

Distributed  algorithms  for  traversal  and  election  are  strongly  related  to  each 
other.  On  the  one  hand,  four  of  the  six  election  algorithms  presented  in  this  disserta¬ 
tion  use  some  sort  of  traversal  algorithm  as  a  building  block.  On  the  other  hand,  if 
initiated  only  by  one  node,  these  election  algorithms  are  turned  into  a  traversal  algo¬ 
rithm.  The  other  two  algorithms,  when  initiated  only  by  one  node,  are  turned  into  a 
traversal  i.i  which  a  few  nodes  are  visited  simultaneously  by  different  tokens. 

A  modular  technique  to  design  efficient  election  algorithms  on  a  network  for 
which  a  traversal  algorithm  is  given  was  presented  by  Korach  et  al.  [Kor85].  Apply¬ 
ing  their  algorithm,  which  was  developed  independently  of  this  study,  to  a  complete 
network  yields  algorithm  B  of  Chapter  2.  However,  their  algorithm  lucks  the 
improvements  which  we  have  introduced  in  algorithm  C  of  Chapter  2. 

1.4.  Models 

Distributed  algorithms  for  three  different  models  of  data  communication  net¬ 
works  are  presented  in  this  dissertation.  The  three  models  are:  the  synchronous  com¬ 
plete  network,  the  asynchronous  complete  network,  and  the  unidirectional  strongly 
connected  network.  The  three  models  are  based  on  the  traditional  message-passing 
model  of  data  communication  networks  [Bur80,  Lyn81,  Seg83].  In  this  Section  we 

first  present  the  traditional  model  and  then  discuss  the  variations  used  in  the 
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dissertation. 


A  computer  communication  network  is  a  set  of  n  processors  (nodes)  con¬ 
nected  by  a  set,  E,  of  bidirectional  communication  lines  (links).  Each  link  connects 
a  distinct  pair  of  processors.  Each  processor  has  a  set  of  input  ports  and  output 
ports.  Each  communication  line  is  modeled  by  connecting  an  coutput ;  input>  pair 
of  ports  of  one  processor  to  an  <input ;  output>  pair  of  ports  of  another  processor. 
The  following  assumptions  are  made: 

1.  Associated  with  each  node  is  a  unique  identifier  number  (id).  We  assume 
that  every  id  can  be  written  in,  at  most,  O  (log  n  )1,2  bits. 

2.  Within  each  processor,  each  input  and  each  output  port  has  a  unique  port  id 
which  is  known  to  the  processor.  Thus  each  port  can  be  uniquely  identified 
by  its  processor. 

3.  Initially  (before  the  algorithm  starts),  aside  from  its  port  ids  each  processor 
knows  nothing  about  the  network.  In  particular,  the  ids  of  processors  con¬ 
nected  on  the  other  side  of  each  link  are  not  known  to  the  processor.  More¬ 
over,  processors  initially  have  no  global  knowledge  such  as  the  network 
topology  or  the  total  number  of  nodes. 

4.  The  communication  lines  are  reliable.  Messages  transmitted  over  the  com¬ 
munication  lines  incur  an  unpredictable,  but  finite,  delay  and  arrive  at  the 
input  port  in  the  order  sent.  Queueing  delays  are  included  in  the  messages 

1-  A  function  of  n,  T(n),  is  0(F(n ))  ("is  oh  F (n)")  if  there  are  positive  constants 
c  and  n  0  such  that  for  n  >  n  0,  we  have  T  (n )  <cF(n). 

2-  Unless  otherwise  specified  all  logarithms  are  to  the  base  2.  Note  that  O(logn) 
does  not  depend  on  the  base  of  the  logarithm  since  log an  =  c\ogbn,  where 
c  =\ogab. 


delay. 

5.  All  messages  received  at  a  node  are  stamped  with  the  identification  of  the 
port  (link)  through  which  they  arrived.  Messages  from  all  input  ports  are 
transferred  into  a  central  queue.  The  processor  receiving  the  messages 
processes  them  one  at  a  time  in  the  order  that  they  arrive  at  the  central  queue. 

6.  The  processing  time  of  a  message  is  negligible  compared  to  its  communica¬ 
tion  delay. 

Several  variations  on  the  above  model  are  possible.  In  particular,  we  con¬ 
sider  the  following  variations: 

1.  Communication  lines  can  be  either  unidirectional  or  bidirectional. 

2.  The  underlying  topology,  which  is  known  to  the  algorithm  designer,  can  be 
either  arbitrary  strongly  connected  unidirectional  network  topology,  or 
bidirectional  complete  network  topology. 

3.  The  communication  mode  can  be  either  synchronous,  or  asynchronous. 

Unlike  a  bidirectional  communication  link,  a  unidirectional  communication 
link  from  node  v  to  node  u ,  can  carry  messages  only  from  v  to«.  A  unidirectional 
network  is  a  network  in  which  some  or  all  the  links  are  unidirectional.  In  such  net¬ 
works  a  communication  line  is  modeled  by  a  connection  of  an  <output>  port  of  one 
processor  to  an  <input>  port  of  another  processor.  A  unidirectional  network  is 
called  strongly  connected  if  there  is  a  directed  path  from  every  node  in  the  network 
to  every  other  node.  A  unidirectional  link  from  v  to  u  is  called  an  outgoing  link  of 
v  and  incoming  link  of  u . 
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In  practice,  networks  in  which  communication  is  unidirectional  appear  in  a 
few  forms.  For  example,  due  to  antenna  power  differences  in  packet  radio  networks, 
the  hearing  matrix  is  not  symmetric,  i.e.,  some  links  are  unidirectional  [Kah78]. 
Examples  of  point  to  point  unidirectional  networks  are  found  in  fiber  optic  networks 
and  microwave  communication  networks. 

Two  topologies  are  considered  in  this  work,  the  complete  network,  and  the 
strongly  connected  unidirectional  network.  In  a  complete  network  of  n  nodes,  every 
node  is  connected  by  n-l  bidirectional  communication  links  to  all  other  nodes.  All 
the  links  incident  to  a  given  node  on  which  no  message  was  sent  or  received  are 
indistinguishable  to  this  node. 

Considered  here  are  two  modes  of  communication,  synchronous  and  asyn¬ 
chronous  .  In  the  synchronous  mode,  a  global  clock  is  connected  to  all  the  nodes  in 
the  network.  The  time  interval  between  two  consecutive  pulses  of  the  clock  is  a 
round .  At  the  beginning  of  each  round,  each  node  decides,  according  to  its  state, 
what  messages  to  send  and  on  which  links  to  send  them.  Each  node  then  receives 
any  messages  sent  to  it  in  this  round  and  uses  the  received  messages  and  its  state  to 
decide  on  its  next  state.  Spontaneously  awakened  nodes  start  a  distributed  algorithm 
by  entering  an  initial  state  and  then  waiting  for  the  beginning  of  the  next  round. 
Further  variations  on  the  assumptions  in  the  synchronous  mode  are  possible  and  are 
discussed  in  more  detail  in  Chapter  3.  In  the  asynchronous  mode  there  is  no  global 
clock,  and  messages  incur  arbitrary  but  finite  delay. 

In  Table  1.1  the  different  combinations  of  parameters  used  in  this  work  are 
summarized.  In  particular,  we  consider 

(1)  synchronous  complete  bidirectional  networks, 

(2)  asynchronous  complete  bidirectional  networks  and 
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1.5.  Performance  Measures 


The  interesting  performance  measures  of  distributed  algorithms  are:  the 
amount  of  communication  and  the  amount  of  time  which  are  required  in  the  execu¬ 
tion  of  the  algorithm.  Hence,  two  measures  of  performance  are  used  to  analyze  dis¬ 
tributed  algorithms  --  communication  complexity,  and  time  complexity. 

Communication  complexity  is  the  total  number  of  messages  sent,  in  the  worst 
case,  by  the  algorithm.  Each  message  is  assumed  to  contain  no  more  than  C>(log/i ) 
bits.  Thus  preventing  an  algorithm  from  sending  fewer  but  large  messages.  O  (log  n ) 
is  the  number  of  bits  which  we  assume  required  to  represent  one  node  id. 


Alternatively  we  also  use  bit  complexity  as  a  measure  of  communication 
complexity.  The  bit  complexity  of  an  algorithm  is  the  total  number  of  bits  of  all  mes¬ 


sages  transmitted  by  the  algorithm,  in  the  worst  case. 

Time  complexity  is  the  worst  case  length  of  the  time  interval  from  the  first  to 
the  last  message  transmission  due  to  the  algorithm.  As  was  stated  before,  processing 
time  is  assumed  to  be  zero  and  therefore  we  do  not  consider  processing  time  com¬ 
plexity  in  this  study. 

The  time  complexity  of  a  synchronous  algorithm  is  well  defined  by  the  above 
definition;  however  the  time  complexity  of  an  asynchronous  algorithm  is  unbounded, 
in  the  worst  case,  since  messages  can  incur  an  arbitrary,  but  finite,  delay.  To  analyze 
the  time  complexity  of  asynchronous  algorithms,  our  assumptions  must  be  modified. 
Thus,  in  the  asynchronous  mode  of  communication,  and  only  for  the  purpose  of  time 
complexity  analysis,  we  assume  that  the  transmission  of  a  message  over  any  link 
incurs  at  most  one  time  unit  delay.  In  arguing  the  time  complexity,  we  shall  allow  a 
message  to  traverse  a  link  in  any  fraction  of  the  time  unit.  This  enables  us  to  con¬ 
struct  scenarios  in  which  some  messages  are  delivered  as  fast  as  we  want  relative  to 
other  messages. 

1.6.  Previous  Work 

Distributed  algorithms  as  a  solution  to  the  problem  of  distributively  electing 
a  leader  first  appeared  in  1977  in  two  different  topologies.  One  group  of  researchers 
[Dal77,  Spi77,  Gal77]  tackled  the  problem  for  arbitrary-topology  bidirectional  net¬ 
works,  while  another  group  [LeL77,  Cha79]  treated  the  problem  for  both  bidirec- 


tional  and  unidirectional  rings  \ 


1.6.1.  Election  in  Ring  Networks 

In  [LeL77],  Le  Lann  faced  the  problem  of  electing  a  leader  in  a  unidirec¬ 
tional  ring  while  designing  a  scheme  for  mutual  exclusion  in  a  distributed  environ¬ 
ment  Le  Lann  studied  a  system  of  n  controllers  taking  turns  allocating  resources  to 
users.  The  controllers,  each  of  which  has  a  unique  id,  are  connected  in  a  virtual  uni¬ 
directional  ring.  Mutual  exclusion  among  the  controllers  is  achieved  by  circulating  a 
single  token  around  the  ring.  The  problem  is,  then,  to  design  an  algorithm  which 
will  elect  one  (unique)  controller  to  initiate  a  new  token  in  case  the  previous  token  is 
lost.  Le  Lann  proposed  an  O  (n  2)  message  algorithm  for  the  problem. 

Following  Le  Lann’s  work,  Chang  and  Roberts  [Cha79]  proposed  an  algo¬ 
rithm  with  an  average  message  complexity  of  O  (n  log  n );  however,  the  worst  case 
complexity  is  O  ( n  2 )  messages.  Subsequently,  Hirschberg  and  Sinclair  [Hir80]  gave 
an  0(n  log  n )  message,  in  the  worst  case,  election  algorithm  for  bidirectional  rings. 
Bums  [Bur80]  proved  a  lower  bound  of  Q(n  log  n  )2  messages  for  election  in 
bidirectional  rings.  In  [Hir80],  Hirschberg  and  Sinclair  conjectured  that,  for  uni¬ 
directional  rings,  Q.(n2)  messages  is  the  lower  bound.  However,  Dolev  et  al. 
[Dol82]  and  Peterson  [Pet82]  both  disproved  the  conjecture  by  presenting  a  sequence 
of  unidirectional  algorithms,  each  improving  on  the  other.  The  last  improvement, 
given  in  [Dol82],  obtained  a  1.356n  logn  message  algorithm.  A  lower  bound  of 

1-  A  ring  topology  is  a  circular  arrangement  of  processors  in  which  every  processor 
is  connected  by  a  link  to  each  of  its  two  neighbors. 

2-  A  function  of  n,  Tin),  is  Q (F (n))  ("is  omega  Fin)”)  if  there  exists  a  positive 
constant  c  such  that  Tin)  >  c  F (n)  infinitely  often  (for  infinite  number  of  values  of 
n ).  This  definition,  taken  from  [Aho83]  ,  is  not  symmetric  to  the  big-oh  notation. 
Because  an  algorithm  can  be  efficient  on  many  but  not  all  values  of  n .  However  in 
this  work  the  symmetric  definition  would  be  sufficient,  i.e.,  there  exist  positive 
constants  c  and  n  0  such  that  T in  )  <  c  F  in  )  for  all  n  >  n  0. 


0.693/1  log  n  messages  for  unidirectional  rings  was  obtained  by  Pachl  et  al  [Pac82]. 

Recently,  Frederickson  and  Lynch  [Fre84]  addressed  the  problem  of  electing 
a  leader  in  synchronous  rings.  They  showed  that,  for  synchronous  networks,  one 
should  distinguish  between  two  types  of  algorithms:  general ,  in  which  nodes  may 
perform  any  computation  on  the  values  of  their  ids;  and  comparison ,  in  which  the 
values  of  ids  can  be  used  only  for  comparison  with  each  other.  On  the  one  hand, 
they  gave  an  Q(/»  log  n )  message  lower  bound  for  comparison  algorithms.  On  the 
other  hand,  they  presented  an  O(n)  general  algorithm  (which  was  also  independently 
discovered  by  P.  Vitanyi  [Vit84] ),  thus  showing  that  general  algorithms  are  strictly 
more  powerful  than  comparison  algorithms  in  synchronous  rings.  However,  the  time 
complexity  of  the  general  algorithm  is  exponential  in  the  value  of  the  smallest  id 
around  the  ring.  Noticing  the  discrepancy  between  the  time  complexity  of  the  linear 
message  complexity  general  algorithm  and  the  O(n  \ogn)  message  comparison 
algorithm,  they  proved  the  following  relation  between  three  parameters  of  general 
election  algorithms  in  a  synchronous  ring  which  are:  (1)  the  upper  bound  on  the 
time  complexity,  (2)  the  lower  bound  on  the  message  complexity,  and  (3)  the  size  of 
T,  the  set  of  ids  from  which  ids  for  nodes  around  the  ring  are  selected.  Specifically 
they  proved,  that  if  the  time  complexity  of  a  general  algorithm  is  upper  bounded  by 
d ,  then  there  exists  T ,  such  that  the  message  complexity  of  the  algorithm  is  lower 
bounded  by  Q(n  \ogn).  The  relation  they  proved  is  such  that  the  size  of  T,  \T  |, 
grows  very  fast  with  both  d  and  n.  More  recently,  Gafni  [Gaf85]  presented  an 
O  (n  log*  n )  message,  O  (G  _1(G  (n ))  |  T  | )  time  general  algorithm  for  election  in  syn¬ 
chronous  rings  (G  (n  )=log *n)1,  thus  improving  on  the  time  complexity  of  Frederick¬ 
son  and  Lynch’s  algorithm. 

1-  log*/t  is  the  minimum  number  of  times  we  have  to  take  log  from  n  to  get  a 
number  smaller  than  1.  Alternatively,  it  is  the  inverse  of  the  function  F  (n  )  which  is 
defined  as  follows:  F(0)  =  1 ,  F(i)  =  2F(‘~l)  (i.e.,  G  (n )  =log  *n  =  F(n  )_1). 


1.6.2.  Election  in  General  Networks 


Fault  tolerance  and  broadcast  protocols  motivated  the  development  of  distri¬ 
buted  election  algorithms  in  arbitrary-topology  bidirectional  networks.  In  fault  toler¬ 
ance  applications,  a  central  controller  of  a  distributed  system  sometimes  need  to  be 
chosen  to  replace  a  faulty  one.  For  example,  in  the  TYMNET  public  network 
[Tym71,  Rin77],  at  any  given  time  there  is  one  node,  called  SAM  (Supervisor  in 
Active  Mode),  which  allocates  virtual  routes  for  new  sessions  between  users  distri¬ 
buted  in  the  network.  Periodically,  all  the  nodes  in  the  network  update  SAM  with 
the  delays  on  their  incident  links.  SAM  uses  this  information  to  issue  any  newly 
requested  virtual  routes  while  keeping  the  overall  delay  of  the  network  down.  When 
SAM  goes  down,  a  new  node  is  elected  to  make  the  routing  decisions.  Other  exam¬ 
ples  of  applications  of  the  election  algorithm  are  situations  where  a  faulty  primary 
site  in  a  replicated  distributed  file  system  [Als76]  or  a  faulty  lock  coordinator  in  a 
distributed  data  base  system  [Men78]  need  to  be  replaced. 

Distributed  algorithms  for  spanning  tree  construction,  rather  than  distributed 
election,  were  motivated  by- the  design  of  efficient  broadcast  protocols.  However, 
any  algorithm  to  construct  a  spanning  tree  can  be  transformed  into  an  election  algo¬ 
rithm  by  sending  O  ( n )  more  messages  as  described  below.  Hence,  any  efficient 
algorithm  for  spanning  tree  construction  is  also  an  efficient  election  algorithm. 

In  a  distributed  spanning  tree  algorithm,  every  node  marks  some  of  its 
incident  links;  the  collection  of  marked  links  constitutes  a  spanning  tree.  Once  a 
spanning  tree  is  constructed,  an  election  phase  is  implemented  by  associating  direc¬ 
tions  with  each  link  in  the  tree.  Every  node,  for  which  all  incident  links  but  one  have 
already  been  directed,  directs  that  one  link  outward.  This  process  starts  at  the  leaves 

and  terminates  when  two  nodes  simultaneously  try  to  direct  the  link  connecting  them 
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in  opposite  directions.  The  highest  id  node  of  these  two  is  then  elected  as  the  leader. 


In  his  Ph.D.  dissertation,  Dalai  [Dal77]  addressed  the  problem  of  distribu- 
tively  constructing  a  spanning  tree  in  the  design  of  efficient  network  broadcast  proto¬ 
cols.  One  simple  way  to  broadcast  a  message  in  a  network  is  to  send  a  copy  of  the 
message  over  each  link;  however,  in  large  networks  this  mechanism  could  be  too 
costly,  in  particular  when  considering  frequent  broadcastings  by  different  nodes.  To 
reduce  the  message  complexity  of  broadcasting,  Dalai  suggested  first  defining  a 
minimum  weight  spanning  tree  (MST)  on  the  network  (the  link  weights  being  the 
cost  of  transmitting  one  message  over  each  link)  and  then  broadcasting  by  sending 
one  copy  of  the  message  over  each  link  of  the  MST.  Thus,  for  the  price  of  one  MST 
construction,  Dalai  reduced  the  message  complexity  of  broadcasting  from  0{\E\) 
to  O(n).  In  his  dissertation,  Dalai  gives  a  distributed  algorithm  for  MST  construc¬ 
tion.  The  message  complexity  of  the  algorithm  was  not  analyzed  but  is  believed  to  be 
worse  than  that  of  more  recent  algorithms. 

Spira  [Spi77]  followed  up  on  the  algorithm  of  Dalai  and  obtained  a  distri¬ 
buted  MST  algorithm  with  average  message  complexity  of  0(\E  |+n  logo).  In 
[Gal83],  Gailager,  Humblet  and  Spira  have  further  improved  on  the  algorithms  of 
[Dal77,  Spi77]  to  obtain  an  0(\E  l+nlogn)  worst  case  message  complexity  algo¬ 
rithm. 

In  [Gal83],  every  awakened  node  starts  to  construct  a  subtree  (fragment)  of 
the  MST  by  iteratively  selecting  the  minimum  weight  edge  adjacent  to  its  already 
constructed  fragment.  A  variable,  called  level,  is  associated  with  each  fragment. 
Whenever  two  growing  fragments  meet,  the  level  variables  are  used  to  economically 
combine  them  into  one  fragment.  The  algorithm  terminates  when  the  whole  network 
is  spanned  by  one  fragment. 
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The  time  complexity  of  the  algorithm  in  [Gal83]  is  O  (n  log  n ).  Recently, 
Gafni  [Gaf85]  further  improved  the  algorithm  of  Gallager  et  al.  to  reduce  its  time 
complexity  to  O  (n  log*  n )  by  modifying  the  definition  of  the  level  variables  and  the 
mechanism  by  which  fragments  merge.  The  algorithm  of  Gafni  has  thus  established 
the  best  known  time  upper  bound  for  message  optimal  election  algorithms  in  asyn¬ 
chronous  general  networks. 

While  the  main  goal  in  [Dal77,  Spi77,  Gal83]  was  to  construct  an  MST,  an 
upper  bound  of  0(\E  |  +*  log n )  messages  on  the  election  problem  was  already 
established  by  Gallager  in  1977  [Gal77],  when  he  presented  an  election  algorithm 
with  this  complexity.  In  [Gal77],  each  spontaneously  awakened  node  starts  a  depth 
first  search  process  which  tries  to  traverse  all  the  links  of  the  network.  When  two 
traversing  processes  meet,  the  one  which  has  already  visited  more  nodes  kills  the 
other  and  continues.  The  depth  first  search  process  which  survives  all  the  others 
elects  its  initiating  node  as  the  leader  of  the  network. 

Q(  I  £  | )  is  clearly  a  lower  bound  for  election  in  asynchronous  general  net¬ 
works  (also,  in  rings)  since  no  algorithm  may  terminate  before  sending  at  least  one 
message  over  each  link;  otherwise,  an  untraversed  link  could  be  the  only  link  con¬ 
necting  two  parts  of  the  network,  each  holding  a  separate  election.  Following 
[Bur80],  Q(n  logn)  is  also  a  lower  bound.  Thus,  ©(|£  \+n  logn) 1  is  both  the 
upper  and  lower  bound  on  the  message  complexity  of  the  election  in  the  asynchro¬ 
nous  general  networks. 


1-  A  function  of  n,  T(n),  is  @(F(n))  ("is  theta  F (n )")  if  it  is  both  0(F(n))  and 
Q (F(n)),  i.e.,  there  exist  positive  constants  c  h  c'i  and  n0  such  that 
clF(n)<T(n)<  c->  F  (n  )  for  all  n  >  n0. 


1.6.3.  Election  in  Complete  Networks 

It  was  shown  by  Korach,  Moran  and  Zaks  [Kor84]  that  in  complete  net¬ 
works,  unlike  in  rings  and  general  networks,  the  Q(  |  E  | )  lower  bound  does  not  hold. 
The  reason  being  that,  in  a  complete  network,  an  election  algorithm  can  be  ter¬ 
minated  once  some  node  has  communicated  with  all  its  neighbors.  Subsequently, 
Korach,  Moran  and  Zaks  presented  a  S  n  log n+0(n)  message  0(n  log*)  time 
algorithm  and  an  Q(n  log  n )  message  lower  bound  for  election  in  asynchronous 
complete  networks.  Their  algorithm  is  essentially  the  same  as  the  MST  algorithm  of 
Gallager  et  al.  [Gal83]  with  the  observation  of  the  simplified  termination  detection 
of  complete  networks. 

1.6.4.  Election  in  Unidirectional  Networks 

Prior  to  the  algorithm  given  in  this  dissertation,  no  algorithm  specifically 
designed  for  election  in  strongly  connected  unidirectional  networks  has  been 
observed.  However,  two  bidirectional  distributed  algorithms  [Seg83,  Gal76]  can 
easily  be  turned  into  unidirectional  election  algorithms.  In  [Seg83],  Segall  presents 
a  connectivity  checking  algorithm  upon  whose  termination  every  node  knows  the  ids 
of  all  the  other  nodes  connected  to  it.  The  shortest  path  algorithm  in  [Gal76]  exhi¬ 
bits  the  same  property  when  it  terminates.  The  communication  complexity  of  the 
two  algorithms  is  0(n\E  1  logn)  bits,  and  each  node  is  assumed  to  have 
O  (n  log  n )  bits  of  memory. 

The  unidirectional  variation  of  the  two  algorithms  proceeds  in  two  phases:  In 
the  first  phase,  every  node  acquires  the  ids  of  its  incoming  neighbors;  in  the  second, 
it  acquires  the  ids  of  all  the  other  nodes  in  the  network.  The  details  of  this  algorithm 
are  postponed  to  the  introduction  of  Chapter  5.  The  communication  complexity  of 
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the  algorithm  is  O  ( |  E  |  “  log  n )  bits;  however,  assuming  that  messages  sent  over  one 
link  are  received  in  the  order  transmitted,  the  communication  complexity  can  be 
reduced  to  O  (n  ■  |  E  |  log  n )  bits. 

In  looking  for  a  lower  bound  on  the  problem  of  electing  a  leader  in  unidirec¬ 
tional  networks  Gafni  and  Korfhage  [Gaf84]  designed  an  election  algorithm  for  uni¬ 
directional  Eulerian  networks.  The  message  complexity  of  their  algorithm  is 
0(\E  |  log n). 

1.7.  Dissertation  Overview 

In  this  dissertation  we  will  present  distributed  algorithms  for  three  different 
models:  complete  synchronous  networks,  complete  asynchronous  networks  and 
asynchronous  strongly-connected  unidirectional  networks  (refer  to  Table  1.1). 

Five  algorithms  for  election  in  complete  networks  are  presented  in  Chapter  2 
(see  Table  2.1).  In  Section  2.3,  we  present  a  3  n  log  n  message,  O  (log  n )  time  syn¬ 
chronous  algorithm. 

In  trying  to  apply  the  synchronous  algorithm  on  an  asynchronous  network, 
the  time  complexity  degrades  to  0  (n )  and  its  message  complexity  to  5 n  log  n  (Sec¬ 
tion  2.4).  The  asynchronous  algorithm  is  an  improvement  over  the  considerably 
more  complicated  algorithm  in  [Kor84]  ,  whose  time  complexity  is  O  ( n  log  n )  and 
message  complexity  is  5  n  log  n  + 0  (n ). 

In  an  effort  to  reduce  the  message  complexity  of  the  asynchronous  algorithm 
to  2/x  Iogn  while  maintaining  its  linear  time  complexity,  we  present  a  sequence  of 
three  more  asynchronous  algorithms  (A,  B  and  C,  Section  2.5).  The  first  two  algo¬ 
rithms  present  tradeoffs  between  time  and  message  complexities.  Algorithm  A 


(which  was  also  derived  independently  in  [Hum84]  )  has  O  (« )  time  complexity  and 
2.773-rt-logn  message  complexity.  Algorithm  B  has  O(n  logn)  time  complexity 
but  2-n-logn  message  complexity.  Analyzing  the  communication  and  time  com¬ 
plexities  of  the  two  algorithms,  we  derive  a  third  algorithm,  algorithm  C,  whose  time 
complexity  is  O  (n )  and  communication  complexity  is  2 n  log  n ,  an  improvement  on 
the  O  ( n  log  n )  time  and  2 n  log  n  message  algorithm  of  [Pet84].  It  remains  an  open 
question  whether  a  sublinear-time,  message-optimal  (O  (n  logn)  messages)  asyn¬ 
chronous  algorithm  exists.  We  conjecture  that  such  an  algorithm  does  not  exist,  i.e., 
that  the  time  complexity  of  any  asynchronous  message-optimal  election  algorithm  is 
Q(n). 

In  Chapter  3  we  prove  two  lower  bounds  on  the  problem  of  electing  a  leader 
in  synchronous  complete  networks.  First,  we  prove  a  lower  bound  of  Q(n  log  n )  on 
the  message  complexity.  Second,  we  prove  that  any  message-optimal  synchronous 
algorithm  requires  Q(log  n )  time.  In  proving  these  bounds,  we  do  not  restrict  the 
type  of  operations  performed  by  nodes.  The  bounds  thus  apply  to  general  algorithms 
and  not  just  to  comparison-based  algorithms.  This  proves  that  the  synchronous  algo¬ 
rithm  of  Chapter  2  is  optimal. 

We  prove  that  the  message  complexity  of  any  election  algorithm,  comparison 
or  general,  in  a  complete  synchronous  or  asynchronous  network  is  &(n  log  n ).  This 
proves  that,  for  the  problem  of  election  in  complete  networks  (unlike  rings),  general 
algorithms  are  not  more  powerful  than  comparison  algorithms.  The  difference 
between  synchronous  rings  and  synchronous  complete  networks  stems  from  the  fact 
that  in  a  ring  all  nodes  can  be  distributi vely  awakened  with  n  messages,  whereas  in 
the  complete  network  the  awakening  problem  is  as  hard  as  the  election  problem, 
requiring  f2(nlogrc)  messages.  If  all  the  nodes  of  a  complete  network  could  be 


awakened  with  n  messages,  then  a  general  algorithm  could  take  advantage  of  the 
synchronous  mode  of  communication  to  elect  a  leader  in  a  linear  number  of  mes¬ 
sages  by  using  the  principles  suggested  in  [Gaf85]. 


We  also  prove  an  ft(logn)  lower  bound  on  the  time  complexity  of  any 
message-optimal  election  algorithm  in  synchronous  complete  networks. 
Specifically,  we  show  that,  if  the  time  complexity  of  an  election  algorithm  (whether 

comparison  or  general)  is  upper  bounded  by  y  log,,  a  rounds,  then  its  message  com- 

c — \ 

plexity  is  lower  bounded  by  Q(— - n  log  n ). 

2-logc 

In  Chapter  4  three  algorithms  for  traversal  of  unidirectional  networks. 
Traversal- 1,  -2  and  -3,  are  presented.  Traversal- 1  is  simple  but  inefficient.  In  many 
networks,  the  process  of  Traversal- 1  hops  over  an  exponential  number  of  links 
before  terminating.  Traversal-2,  which  is  based  on  the  centralized  depth  first  search 
algorithm,  makes  at  most  0(n  \E  | )  hops  on  any  network.  Furthermore,  we  show 
that,  in  general,  Q(n  ■  \  E  | )  is  a  lower  bound  on  the  number  of  hops. 

In  both  Traversal- 1  and  -2  O  (log  n )  bits  of  memory  are  required  at  each  node 
and  that  same  amount  is  carried  along  with  the  traversing  process  (i.e.,  message  size 
is  0(logn)  bits).  In  some  applications,  such  as  VLSI,  memory  size  and  message 
length  are  restricted,  and  a  question  then  arises  whether  a  unidirectional  traversal 
could  be  implemented  using  only  a  constant  number  of  bits  in  every  node,  and  on  the 
traversing  process  (i.e.,  in  a  unidirectional  network  of  finite  automata).  In  Chapter  4 
we  answer  the  question  in  the  affirmative  by  presenting  Traversal-3,  a  traversal  algo¬ 
rithm  for  unidirectional  networks  of  finite  automata.  Traversal- 3  makes  at  most 
0(n-\E  |+n  logn)  hops,  which  is  optimal  in  the  worst  case  (dense  networks,  in 
which  |  E  !=Q(n log  «)). 
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Both  Traversal-2  and  -3  yield  two  spanning  trees,  both  rooted  at  the  root  of 
the  traversal,  one  an  incoming  tree  and  the  other  an  outgoing  tree.  The  structure 
defined  by  the  union  of  these  two  trees  is  shown  to  be  useful  in  various  applications 
such  as  broadcasting,  routing  and  termination  detection. 

In  Chapter  5  we  present  a  distributed  algorithm  for  election  in  strongly- 
connected  unidirectional  networks.  The  algorithm  distinguishes  a  single  processor 
from  all  other  processors  in  the  network.  The  algorithm  requires  O  (log  n )  bits  of 
memory  in  each  processor,  and  its  communication  complexity  is  O  (n  ■  |  E  |  +n  2log  n ) 
bits. 


As  with  Traversal-2  and  -3,  the  election  algorithm  yields  two  directed  span¬ 
ning  trees,  both  rooted  at  the  elected  leader,  one  an  incoming  tree  and  the  other  an 
outgoing  tree.  The  algorithm  is  an  improvement  on  the  connectivity  checking  algo¬ 
rithm  of  Segall  [Seg83]  and  the  shortest  path  algorithm  of  Gallager  [Gal76],  both  of 
which  can  easily  be  modified  to  work  on  a  unidirectional  network  (see  Section 
1.6.4).  The  communication  complexity  of  the  two  algorithms  is  0(n  \E  |  'log  n ) 
bits,  and  each  node  is  assumed  to  have  0(n  log  n)  bits  of  memory.  Furthermore, 
unlike  our  algorithm,  neither  Segall’ s  nor  Gallager’ s  algorithm  provides  the  span¬ 
ning  trees. 
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CHAPTER  2. 


ALGORITHMS  FOR  ELECTION  IN  COMPLETE  NETWORKS 

In  this  and  the  next  chapter  we  address  the  problem  of  electing  a  leader  in 
complete  networks.  In  this  chapter  five  election  algorithms  for  synchronous  and 
asynchronous  complete  networks  are  presented  (see  Table  2.1),  while  tight  lower 
bounds  on  the  message  and  time  complexity  for  the  synchronous  case  are  given  in 
Chapter  3. 

The  message  complexity  of  the  five  algorithms  presented  in  this  chapter,  is 
O  (n  log  n ),  where  n  is  the  total  number  of  nodes  in  the  network.  However,  the  time 
complexity  of  the  synchronous  algorithm,  O(logn),  is  considerably  better  than 
O  (n ),  the  time  complexity  of  the  asynchronous  algorithms,  thus  suggesting  that  the 
synchronous  mode  of  communication  is  more  powerful  than  the  asynchronous  mode. 


H 

Communication 

Messages 

Time 

■ 

Mode 

Complexity 

Complexity 

II 

KUSBI 

3 

Asynchronous 

6  n  log  n 

0{n) 

4.1 

Asynchronous 

2.77-n  log  n 

0(a) 

Kil 

Asynchronous 

2  n  -log  n 

O(nlogn) 

0  (n ) 

Table  2.1 


2.1.  Introduction 
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In  a  complete  network  every  node  is  connected  to  all  the  other  nodes.  Before 
the  algorithm  starts,  no  node  has  any  information  on  any  of  the  other  nodes.  Thus, 
the  incident  links  of  a  node,  on  which  no  message  was  sent  or  received,  are  indistin¬ 
guishable. 

Consider  the  following  straightforward  election  algorithm  in  complete  net¬ 
works.  Every  initiator  starts  the  algorithm  by  sending  messages,  containing  its  id,  to 
all  its  neighbors.  All  the  initiators  then  elect  the  highest  id  initiator  as  the  leader. 
The  time  complexity  of  this  algorithm  is  two  time  units,  and  its  worst  case  message 
complexity  is  O  (n2)  (the  message  complexity  is  k-n  where  k  is  the  number  of  initia¬ 
tors).  In  Section  2.4  the  message  complexity  of  this  simple  algorithm  is  reduced  to 
O(n  logn)  by  slowing-down  the  rate  at  which  initiators  send  messages  to  their 
neighbors  to  one  message  at  a  time.  However,  the  reduced  rate  increases  the  time 
complexity  of  the  algorithm  to  O  (n ).  In  Section  2.2  we  use  the  synchronous  model 
of  communication  to  design  an  O  (log  n )  time,  message-optimal  algorithm.  This  is 
done  by  carefully  selecting  a  dynamic  rate  at  which  initiators  send  messages  to  their 
neighbors. 

2.2.  The  Synchronous  Algorithm 

In  this  section  we  present  a  2-log  a  rounds,  3n  logn  messages  synchronous 
algorithm.  In  the  next  chapter  we  will  show  that  this  algorithm  is  message  optimal 
and  is  as  fast  as  a  message-optimal  algorithm  can  be. 
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2.2.1.  Description  of  the  Algorithm 


The  algorithm  is  initiated  by  any  subset  of  nodes,  each  of  which  is  a  candi¬ 
date  for  leadership.  Each  candidate  tries  to  capture  all  other  nodes  by  sending  mes¬ 
sages  on  all  the  links  incident  to  it.  The  candidate  that  has  succeeded  in  capturing  all 
its  neighbors  elects  itself  as  the  leader.  To  guarantee  that  only  one  node  succeeds, 
all  candidates  but  one  are  killed . 

To  simplify  the  algorithm  every  initiator  node  spawns  two  processes,  the 
candidate  process  and  the  ordinary  process.  The  two  processes  are  connected  to 
each  other  by  a  bidirectional  logical  link  which  behaves  like  a  physical  link.  A  node 
awakened  by  receiving  a  message  of  the  algorithm  spawns  only  an  ordinary  process. 
Candidate  processes  communicate  only  with  ordinary  processes  and  vice  versa. 
Thus,  the  communication  topology  is  a  complete  bipartite  graph,  on  one  side  the  can¬ 
didate  processes  and  on  the  other  side  n  ordinary  processes.  Henceforth,  the  term 
candidate  will  be  applied  interchangeably  to  both  the  process  and  its  initiating  node. 
All  messages  received  by  a  node  are  tagged  according  to  the  type  of  their  sending 
process.  Messages  received  from  candidate  processes  are  forwarded  to  the  ordinary 
process.  Messages  received  from  ordinary  processes  are  forwarded  to  the  candidate 
process. 

At  every  candidate  the  algorithm  proceeds  in  levels.  Every  live  candidate  at 
level  i,  i> 0,  tries  to  capture  2‘  new  ordinary  processes  by  sending  them  messages 
containing  its  level  and  id.  If  in  the  second  round  of  level  i  the  candidate  receives 
acknowledgments  from  all  the  ordinary  processes  it  tries  to  capture,  it  proceeds  as  a 
candidate  to  the  next  level.  On  the  other  hand,  if  not  all  the  acknowledgments  are 
received,  the  process  (and  hence  the  node  owning  it)  is  eliminated  form  candidacy. 


Every  candidate  has  a  variable  called  level  which  is  incremented  by  one 
every  two  rounds.  Every  ordinary  process  has  an  owner  -level  and  an  owner -id 
variable  which  contain  the  level  and  id  of  the  highest-level  candidate  the  process  has 
received  a  message  from  (level  ties  are  resolved  by  selecting  the  highest  id).  In 
every  round,  every  ordinary  process  first  increases  its  owner-level  by  one,  to  reflect 
the  owner’s  actual  level,  and  then  inspects  the  newly  received  messages  to  update  its 
owner-level  and  owner-id  if  necessary.  If  an  update  occurred,  the  ordinary  process 
acknowledges  its  new  owner. 

A  formal  description  of  the  algorithm  is  given  in  Figure  2.1.  E  is  the  set  of 
edges  incident  to  a  candidate  process.  Every  candidate  maintains  a  list  of  edges, 
called  untraversed ,  which  it  has  not  yet  traversed  in  any  direction. 

2.2.2.  Time  and  Message  Complexities 

Let  p  be  the  largest  id  of  a  candidate  from  the  set  of  oldest  candidates  (i.e., 
whose  level  is  the  largest).  We  observe  the  following  three  facts: 

Fact  1 :  The  owner-level  of  every  node  strictly  increases  from  one  round 
to  the  next. 

Fact  2  :  At  most  — - — -  candidates  reach  level  i ,  1  <  i  <  log  n 

Fact  3  :  21og  n  rounds  after  it  has  started  the  algorithm,  candidate  p  has 
captured  all  the  nodes  and  is  elected  as  the  network  leader. 

Fact  1  follows  immediately  from  the  algorithm  for  ordinary  node  processes. 

Fact  3  holds  because  all  the  messages  of  p  get  acknowledged,  and  once  a  node  has 

acknowledged  p ,  it  will  not  acknowledge  any  other  message.  Fact  2  follows  from 
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Candidate  program: 
untraversed  <—  E 
level <—  - 1  ; 

Each  round  do: 
level  <—  level  +  1  ; 

If  level  is  even 
Then 

If  untraversed  is  empty 
Then 

ELECTED,  STOP 
Else 

K  <—  Minimum  (  2teveli2,  |  un traversed  | ) ; 

Send  (level,  id)  over  K  links  from  untraversed,  and 
remove  these  links  from  untraversed  ; 

Else  /*  level  is  odd  */ 

Receive  all  acknowledgment  type  messages 
If  received  less  than  K  acknowledgments 
Then 

Stop  /*  Not  a  candidate  any  more  */ 

End  each  round. 


Ordinary  program: 

L  *  <—  nil ; 
owner-level «-  -1  ; 
owner-id  <—  id ; 

Each  round  do: 

Send  an  acknowledgment  over  L  *  ; 
owner-level  <—  owner-level  +  1  ; 

Receive  all  candidate  messages  {(Ievel,id)  over  link  L}; 
Let  (level*,  id*)  be  the  lexicographically  largest 
( level ,  id )  candidate  message,  and 
L  *  the  link  over  which  it  arrived  ; 

If  (level*,  id*)>(owner -level ,  owner-id) 

Then 

(owner  -level ,  owner -id  )  f-  (level*,  id*)  ; 

Else 

L*  <—  nil  ; 

End  each  round. 

Figure  2.1:  The  Synchronous  Algorithm 


fact  1  and  the  observation  that  every  ordinary  node  acknowledges  at  most  one  mes¬ 
sage  in  which  the  level  is  i ,0<i  <logn,  i.e.,  the  sets  of  2'"'1  nodes  that  are  captured 
by  each  candidate  that  has  reached  level  i  are  disjoint. 
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Following  fact  3,  the  time  complexity  of  the  algorithm  is  21og  n .  Since  every 
node  sends  at  most  one  acknowledgment  to  a  candidate  in  level  i,  the  total  number 
of  acknowledgments  is  nlogn,  each  of  length  0(1)  bits.  Due  to  fact  2,  the  total 

\ogn  n 

number  of  candidate  messages  is  £  — —2  =  2nlogn,  each  message  containing 

i= i  2<_ 

logn+loglogn  bits.  The  total  communication  complexity  is  thus  3-n-logn  mes¬ 
sages. 

A  continuum  of  algorithms  can  be  devised  to  close  the  gap  between  the 
trivial  0(1)  time,  0(n2)  messages  algorithm  and  the  O(logn)  time,  3n  logn  mes¬ 
sages  algorithm.  Each  algorithm  in  the  continuum  is  the  same  as  the  above,  except 
that  a  candidate  in  level  i  is  trying  to  capture  cl  neighbors,  2<c<n.  The  time  com¬ 
plexity  of  the  algorithm  is  21ogcn ,  and  its  message  complexity  is  2c  n  logcn,  thus 
proving  that  the  lower  bounds  that  will  be  presented  in  Chapter  3  (Theorem  3.2)  are 
tight. 

2.3.  Asynchronizing  the  synchronous  algorithm 

In  this  section  we  apply  the  synchronous  algorithm  to  an  asynchronous  com¬ 
plete  network.  To  maintain  the  O  (n  log  n )  message  complexity  in  the  asynchronous 
communication  mode,  we  are  forced  to  increase  the  time  complexity  to  O  ( n ).  The 
increase  in  the  time  complexity  seems  unavoidable  and  it  remains  open  whether  a 
sublinear  time,  message-optimal  asynchronous  algorithm  exists. 

There  are  two  basic  differences  between  the  asynchronous  and  synchronous 
modes  of  communication.  First,  in  the  asynchronous  mode  there  is  no  global  clock, 
and  second,  messages  incur  an  arbitrary  but  finite  delay.  The  arbitrary  delay  of  mes¬ 
sages  (and  not  the  absence  of  the  clock)  is  the  source  of  the  increase  in  the  time  com¬ 
plexity  of  the  algorithm.  Essentially  the  synchronous  algorithm  could  work  without 
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a  global  clock,  if  all  messages  incur  exactly  the  same  delay  (in  which  also  the  queue¬ 
ing  and  processing  time  are  included).  In  such  a  model  we  assume  that,  if  two 
nodes,  P  and  Q ,  send  messages  at  the  same  time  to  the  same  two  other  nodes,  u  and 
v ,  and  the  message  of  P  arrives  at  u  before  the  message  of  Q  then,  the  message  of 
P  will  arrive  before  also  at  v . 

To  see  that  a  straightforward  application  of  the  synchronous  algorithm  in  the 
asynchronous  model  will  not  work  consider  the  following  situation:  There  are  two 
competing  candidates,  C  t  and  C  2,  each  had  already  successfully  captured  one  node, 
and  both  proceed  to  capture  the  nodes  v  and  u  at  the  same  time.  A  message  of  C  t 
was  the  first  to  arrive  at  v  which  is  then  captured  by  C  t,  while  a  message  of  C2  was 
first  to  arrive  at  u .  Following  the  rules  of  the  synchronous  algorithm,  v  positively 
acknowledges  only  C  y  and  u  positively  acknowledges  only  C  2 .  Thus,  both  candi¬ 
dates  are  killed  since  none  had  all  of  its  messages  positively  acknowledged.  In  the 
following  section  we  present  an  algorithm  which  overcomes  this  and  similar  prob¬ 
lems  in  the  asynchronous  case. 

2.3.1.  Description  of  the  algorithm 

As  in  the  synchronous  case,  the  asynchronous  algorithm  is  started  at  arbitrary 
times  by  an  arbitrary  set  of  nodes,  each  of  which  is  a  candidate  for  leadership.  Each 
candidate  tries  to  capture  the  network  by  sending  messages  on  all  its  incident  links. 
To  guarantee  that  only  one  candidate  is  elected,  all  candidates  but  one  are  killed. 
The  candidate  that  has  succeeded  in  capturing  all  its  neighbors  is  elected  as  the 
leader  of  the  network. 

The  level  variable  of  a  candidate  is  a  function  of  the  number  of  nodes  that  the 
candidate  has  already  captured.  A  candidate  at  level  /  has  already  successfully  cap- 


tured  2l-l  nodes.  As  in  the  synchronous  algorithm,  every  captured  node  has 
owner -level  and  owner -id  variables,  which  respectively  are  the  highest  level 
among  the  candidates  from  which  it  has  received  a  message,  and  the  id  of  one  of 
these  candidates  (which  is  assumed  to  own  it).  The  potential -id  of  a  captured  node 
is  the  id  of  a  candidate  which  tries  to  capture  it.  All  variables  are  initially  set  to  nil. 

Every  candidate  at  level  /,  sends  2/+1-l  messages  containing  its  level  and  id, 
0</  clog  n ,  to  the  2*-l  nodes  it  had  already  captured  and  on  2l  unused  incident 
links.  Unlike  the  synchronous  case,  a  candidate  in  this  algorithm  waits  to  receive 
either  positive  or  negative  acknowledgment  to  each  of  these  messages.  If  2*+1-l 
positive  acknowledgments  are  received  the  candidate  proceeds  to  level  /+1.  On  the 
other  hand,  if  any  of  the  messages  was  acknowledged  negatively,  the  candidate  is 
killed.  It  then  sends  relinquish  messages,  containing  its  id,  to  all  the  nodes  that  it 
had  ever  tried  to  capture,  informing  them  of  its  elimination  from  candidacy. 

When  a  message  <  level c,  idc>  from  candidate  C  arrives  at  node  v  whose 
variables  are  owner -levelv,  owner  -id^,  and  potential  -id,,  we  distinguish  between 
three  cases: 

1.  Either  <  owner  -levelv,  owner  -idv>  or  <  owner  -levelv,  potential  -icLv>  are 
lexicographically  greater  than  <levelc,  idc>,  in  which  case  the  message  is 
acknowledged  negatively. 

owner  -levelv  is  smaller  than  levelc,  in  which  case  candidate  C  is  ack¬ 
nowledged  positively,  and  the  <levelc,idc>  pair  replaces  the 
<  owner  -levelv ,  owner  -idv  >  pair. 

owner -levelv  equals  level c  and  idc  is  greater  than  both  owner  -idv  and  the 

potential -idy.  In  this  case  idc  replaces  potential -idv  and  node  v  waits  for 
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3. 


one  of  the  following  two  events  to  occur  before  acknowledging  candidate  C : 


'  a.  A  relinquish  message  with  the  same  id  as  owner -id^  is  received,  in 

which  case  C  is  acknowledged  positively,  and  the  <  level c,  idc>  pair 
replaces  the  <  owner -levels,  owner  -idv>  pair,  or 

i 

b.  Another  message  whose  < level, id >  is  lexicographically  greater  than 
<  level c,  idc>  arrives,  in  which  case  C  is  acknowledged  negatively. 

Note  that  in  none  of  the  above  cases  does  a  node  put  on  hold  more  than  one 
message.  If  a  message  arrives  at  a  node  which  already  holds  one  message  then  the 
lexicographically  smaller  one  is  acknowledged  negatively  (case  3  b). 

2.3.2.  Time  and  Message  Complexities 

First  we  shall  prove  that  the  algorithm  is  deadlock-free.  Candidate  C  t  can 
cause  another  candidate,  C2,  to  wait  for  it  only  if  (1)  both  try  to  capture  the  same 
node,  v,  at  the  same  level,  and  (2)  C i  captures  v  while  C2  becomes  v’s  potential 
owner  (because  idc<idCi).  Then,  C2  is  waiting  to  get  either  a  positive  or  a  nega¬ 
tive  acknowledgment  from  v .  Node  v  will  send  a  positive  acknowledgement  to  C2 
if  it  receives  a  relinquish  message  from  C  j .  On  the  other  hand,  if  v  receives  a 
higher  level  message,  it  will  send  a  negative  acknowledgement  to  C  2.  Hence,  the  ids 
along  any  chain  of  waiting  candidates  must  be  increasing  and  the  algorithm  is 
deadlock-free. 

To  analyze  the  message  complexity  of  the  algorithm  we  note  the  following 

fact: 


Fact:  The  sets  of  nodes  captured  at  level  /-I  by  candidates  which 


reach  level  / ,  are  disjoint. 


This  fact  follows  from  the  observation  that  at  most  one  candidate  from  all  the 
candidates  which  try  to  capture  the  same  node  at  level  /  will  proceed  to  level  /+1 
(this  candidate  gets  a  positive  acknowledgment  and  does  not  send  a  relinquish  at  that 

n 


level).  Thus,  the  maximum  number  of  candidates  at  level  /  is  — — .  The  total 

2-1 

number  of  messages  due  to  capturing  attempts  is  then  the  number  of  candidates  at 
level  /  times  2/+1-l  times  2,  since  each  message  is  also  acknowledged,  i.e.. 


log* 


n 


2-  y  — 

/=!  i  "I 


(2/+1-l).  Similarly,  the  maximum  number  of  relinquish  messages  is 


log  *  n 

£  - (2/+l-l)  which  is  bounded  by  2  n  logn.  Hence,  the  message  complexity 

/=i  2-1 


of  the  algorithm  is  bounded  by  6 n  log  n . 


The  total  delay  of  the  algorithm  is  composed  of  two  terms;  The  delay 
incurred  by  capturing  messages,  and  the  delay  incurred  while  waiting  for  relinquish 
messages  (which  is  the  overhead  introduced  by  the  asynchronous  model  of  commun¬ 
ication).  While  the  former  contributes  0  (log  n )  delay,  we  will  show  that  the  latter 
takes  O(n). 


To  prove  that  the  time  complexity  of  the  algorithm  is  O  (n )  we  first  give  a 
scenario  which  attains  this  complexity,  and  then  prove  that  0{n)  is  also  the  upper 
bound  on  the  worst  case  time  complexity. 


In  the  following  scenario  —  candidates,  C  1( 


C_,  with  ids  id , 


n  > 
3 


,  id 


such  that,  id,  <  id,+ j,  i  =  1 . y-l,  try  to  capture  the  same  two  nodes,  v  and  u 

two  time  units  after  each  other.  The  scenario  starts  with  candidates.  C .  and  C 


each  of  which  has  already  captured  one  node  and  both  try  to  capture  nodes,  v  and  u 
at  the  same  time  (see  figure  2.2  and  Table  2.2). 


^n/3  O4  O3  Oj 


U  V 

Figure  2.2:  The  O  (n )  time  scenario 

The  capturing  message  of  C  {  arrives  before  the  capturing  message  of  C  2  at  v ,  and 
after  at  u .  Thus,  C  2  captures  u  and  becomes  the  potential  owner  of  v ,  and  C  j  cap¬ 
tures  v  and  becomes  the  potential  owner  of  u.  In  the  next  two  time  units  u  sends  a 
negative  acknowledgment  to  C  t  which  then  sends  a  relinquish  message  to  v .  At  the 
same  time  that  C  |  sends  the  relinquish  to  v ,  candidate  C3  sends  capturing  messages 
to  v  and  u .  The  messages  are  scheduled  such  that  the  message  of  C  3  arrives  at  v 
just  before  the  relinquish  of  C  j.  Thus,  C  3  becomes  the  owner  of  v  and  the  potential 
owner  of  u  (which  is  now  owned  by  C  2).  In  the  next  two  time  units  v  sends  a  nega¬ 
tive  acknowledgment  to  C  2  which  then  sends  a  relinquish  message  to  u .  But,  at  the 
same  time  that  C2  sends  a  relinquish  message  to  u,  candidate  C4  sends  capturing 
messages  to  v  and  u .  The  scenario  proceeds  in  this  pattern  for  2/3 -n  time  units  at 
which  time  all  candidates  except  one  are  killed.  In  Table  2.2  the  scheduling  of  the 
messages  in  the  scenario  is  given. 


Next  we  prove  that  0  (n )  is  also  the  upper  bound  on  the  worst  case  time  com¬ 
plexity.  To  this  end  we  claim  that  every  two  time  units,  either  the  highest  level  can¬ 
didate  increments  its  level  by  one,  or  one  candidate  is  effectively  eliminated.  Since 
there  are  at  most  n  candidates,  and  the  highest  level  is  logn,  the  worst  case  time 
complexity  is  at  most  O  ( n ). 

To  prove  the  claim,  consider  the  first  time,  Tt,  and  the  time  interval,  in 
which  /  is  the  highest  level  in  the  network  i.e. 


!  captures  v 
C  2  captures  u 

C  2  becomes  v  ’s  potential-owner 
C  i  tries  to  capture  u 


u  sends  a  negative  acknowledgment  to 


C  3  becomes  v  ’s  potential-owner 
C  i  relinquishes  v 
C  3  captures  v 

C  3  becomes  u  ’s  potential-owner 


v  sends  a  negative  acknowledgment  toll 


u  sends  a  negative  acknowledgment  to  C 


Clearly,  £  A,  is  the  time  complexity  of  the  algorithm.  In  the  next  lemma  we  argue 

i=0 

that  if  /  persists  as  the  highest  level  in  the  network  for  A/  time,  then  a  number  of 
candidates  linearly  proportional  to  A /  have  been  effectively  eliminated  during  this 
time.  The  term  "effectively"  is  used  since  it  might  be  that  the  nodes  are  notified  of 
their  elimination  some  time  after  The  extra  period  of  time  is  also  linearly  pro¬ 
portional  to  A/  and  hence  the  total  time  complexity  is  O  (n ). 

Let  Nj  denote  the  number  of  live  candidates  in  the  network  at  time  T.  For¬ 
mally,  we  claim: 

Lemma  2.1:  There  exist  positive  constants  k  1  and  k2  such  that, 

r  > 

^7,+fc, \c^Tr  +£/A|,  w^ere  U \  *s  the  number  of  initiators  (candidates) 

which  start  the  algorithm  in  time  interval  A. 

Proof:  Basically  we  claim  that  for  every  two  time  units  in  A,  (where  a  time  unit  is 
the  maximum  delay  of  a  message),  at  least  one  candidate  at  level  /  is  either  killed,  or 
added  to  a  chain  of  waiting  candidates  (a  similar  chain  was  used  in  the  argument  that 
the  algorithm  is  deadlock-free).  If  a  long  chain  is  created,  then  within  some  time 
from  7/.!  at  least  half  the  candidates  in  the  chain  are  killed.  The  constant  k  j 
represents  the  time  it  takes  the  chain  to  unfold  with  at  least  half  the  candidates  killed. 
The  time  it  takes  the  chain  to  unfold  is  at  most  the  time  it  takes  a  message  to  pass 
along  the  chain.  If  k  ,=1  the  time  complexity  is  reduced  by  at  most  half,  and  thus  we 
will  henceforth  assume  for  convenience  that  k{= 1.  The  constant  k2  represents  two 
parameters:  The  rate  at  which  the  chain  of  either  waiting,  or  dead  candidates  is 
created;  and  the  fraction  (at  least  half)  of  candidates  in  the  above  chain  which  get 
killed.  By  appropriately  scaling  the  time  units  we  may  assume  without  loss  of  gen¬ 
erality  (w.l.o.g.)  that  /:-.=  l  as  well. 


LEGEND 

C  Q -  O  v  C  owns  v 

C  Q—  —  — O  v  C  potentially  owns  v 

Figure  2.3:  The  waiting  chain  of  lemma  2.1 
Let  us  now  prove  that  for  every  two  time  units  in  A/  at  least  one  candidate  is 
either  killed,  or  added  to  a  chain  of  waiting  candidates.  If  A/=2  we  are  done. 
Assume  A/>2,  and  let  C !  be  the  first  candidate  to  reach  level  /  at  time  Tt.  Then,  at 
time  T/+2  there  must  have  been  candidate  C2  such  that  C2  has  captured  a  node 
which  C !  tries  to  capture,  and  either  C2  already  caused  the  death  of  Cj,  or  Cj  is 
waiting  for  C2  to  relinquish,  or  to  advance  to  level  l  +  l  (see  figure  2.3).  In  the 
former  case  the  id  of  C  |  is  smaller  than  that  of  C2  while  in  the  latter  it  is  larger.  The 
proof  thus  continues  inductively  by  adding  for  each  two  time  units  in  At  another  can¬ 
didate  to  a  chain  of  either  waiting  or  killed  candidates. 

When  a  chain  of  waiting  candidates  unfolds,  at  least  half  of  the  candidates 
along  the  chain  are  killed.  Since,  if  candidate  C ;  is  waiting  for  C 2,  either  C2  relin¬ 
quishes,  in  which  case  C ;  advances  to  level  /+ 1  (unless  it  is  waiting  for  another  can¬ 
didate),  or  C  2  advances,  in  which  case  C  j  gets  a  negative  acknowledgment  from  the 
node  they  both  tried  to  capture,  and  C  t  relinquishes.  ■ 


2.4.  Algorithms  for  Election  in  Asynchronous  Complete  Networks 

Our  aim  in  this  section  is  to  derive  a  2  n  logn+O  {n)  messages,  linear-time 
asynchronous  algorithm.  To  this  end  we  present  a  sequence  of  three  asynchronous 
algorithms  (A,  B,  and  C),  each  devised  to  circumvent  the  problems  of  the  previous, 
so  that  algorithm  C  achieves  the  desired  complexity. 

The  underlying  mechanism  for  all  three  algorithms  is  similar.  Each  algo¬ 
rithm  is  initiated  by  any  subset  of  nodes,  each  of  which  is  a  candidate  for  leadership. 
Each  candidate  spawns  a  process  which  tries  to  capture  all  the  other  nodes  by  suc¬ 
cessfully  traversing  in  both  directions  all  the  links  incident  to  its  initiator.  The  term 
candidate  will  be  applied  interchangeably  to  both  the  process  and  its  initiating  node. 
The  candidate  which  has  succeeded  in  capturing  all  its  neighbors  becomes  the  leader. 
To  guarantee  that  only  one  node  is  elected,  all  candidates  but  one  are  killed . 

All  candidates  use  a  variable  called  level  to  estimate  the  number  of  nodes 
they  have  already  captured.  The  level  variable  is  used  by  candidates  to  contest  each 
other.  Captured  nodes  also  have  a  level  variable,  which  tracks  the  highest  level  can¬ 
didate  they  have  observed.  All  level  variables  are  initialized  to  0. 

A  candidate  that  arrives  at  a  node  with  a  larger  level  than  its  own  is  elim¬ 
inated  from  candidacy.  However,  if  the  candidate’s  level  is  larger  or  equal,  the 
node’s  level  is  replaced  by  the  candidate’s  level.  The  candidate  may  then  claim  the 
node  and  try  to  eliminate  the  previous  owner  of  the  node.  Upon  being  killed,  the  ini¬ 
tiating  node  of  a  candidate  functions  like  a  regular  captured  node. 

The  three  algorithms  differ  mainly  in  two  parts:  ( 1 )  The  way  that  candidates 

determine  their  level,  and  (2)  The  rule  candidates  use  to  eliminate  each  other.  In 

algorithm  A,  the  level  of  a  candidate  is  the  number  of  nodes  it  has  already  captured 
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(following  [Gal77],  )  In  algorithm  B,  the  level  is  the  number  of  candidates  it  has 
killed.  Algorithm  A  achieves  a  better  time  complexity  while  algorithm  B  achieves  a 
better  message  complexity.  In  algorithm  C,  candidates  use  a  combination  of  the 
above  two  level  functions  to  attain  the  time  complexity  of  A  and  the  message  com¬ 
plexity  of  B. 

2.4.1.  Algorithm  A 

Level:  In  this  algorithm,  the  level  of  a  candidate  is  the  number  of  nodes  it  has 
already  captured. 

Capturing  and  Elimination  Rule:  To  capture  node  v;  (1)  the  (level,  id)  of  a  candi¬ 
date  must  be  lexicographically  larger  than  the  (level,  id)  of  the  previous  owner  of  v , 
and  (2)  the  previous  owner  must  be  killed. 

When  candidate  P  arrives  at  node  v  which  is  currently  owned  by  candidate 
Q ,  the  following  rule  is  used: 

If  ( Level  ( P  ),id  { P ))  <  {Level  (v  ),id { Q )),  P  is  killed. 

If  {Level (P), id (P))  >  {Level {v),id{Q)),  (1)  v  gets  P’s  level,  and  (2)  P  is  sent  to 

Q. 

When  P  arrives  at  Q  : 

If  {Level  {P  ),id  (P ))  <  {Level  {Q  ),id {Q )),  P  is  killed. 

If  Q  has  been  killed  already  then  P  captures  v. 

If  ( Level  (P  ),id(P  ))  >  ( Level  {Q  ),id{Q )),  then  ( 1)  Q  is  killed,  and  (2)  P  captures  v  . 
Upon  returning  to  its  initiating  node  from  a  successful  capturing,  P  increases  its 


To  keep  track  of  its  owning  candidate,  every  captured  node  has  two  link 
pointers,  father  and  potential -father .  The  father  pointer  points  to  the  link  through 
which  the  node  was  most  recently  captured,  and  the  potential-father  pointer  points  to 
the  link  through  which  a  candidate  which  tries  to  claim  the  node  from  its  father,  has 
arrived. 

A  candidate,  C,  that  arrives  at  an  already  captured  node  v  whose  level  is 
smaller  than  its  own,  replaces  v’s  level  with  its  own  and  becomes  v’s 
potential  -father .  C  is  then  sent  to  the  father  candidate  of  v.  If  C  survives  at  v’s 
father,  and  meanwhile  no  other  candidate  replaced  C  as  the  potential-father  of  v, 
then  C  becomes  v's  father.  If  v  has  not  yet  been  captured,  the  potential-father 
automatically  becomes  the  father  of  v . 

A  formal  description  of  algorithm  A  is  given  in  Figure  2.4.  As  in  the  syn¬ 
chronous  algorithm  every  initiator  spawns  two  independent  processes,  candidate  and 
ordinary .  The  two  processes  are  connected  by  a  bidirectional  logical  link  which 
behaves  like  a  physical  link. 

Analysis 

The  algorithm  is  deadlock-free  since  candidates  never  wait  for  each  other, 
and  the  {level,  id)  pair  is  lexicographically  increasing  along  any  chain  of  candidates 
which  kill  each  other. 

The  time  complexity  of  the  algorithm  is  0{n)  since  candidates  never  wait  for 
each  other  and  a  candidate  which  has  done  more  work  is  never  killed  by  a  candidate 
which  has  done  less  work.  Thus,  each  killed  candidate  spent,  in  the  worst  case,  less 


level  4—  owner— id  4—  0  ; 
untraversed  4-  E  ;  father  <^nil  ; 


Candidate  ( id  ) : 

while  (  untraversed  *  0  )  do; 

/  4—  any(  untraversed  ) ; 
send(/d  ,/eve/ )  on  /  ; 

R:  receive//^  ',/eve/ ')  over  / ' ; 

if  (id'  =  id)  then  /*  successful  capturing.*/ 

/eve/  4—  level  +  1  ; 
untraversed  4—  untraversed  -  /  ; 
else  /*another  candidate  tries  to  eliminate  candidate  id  */ 
if  (level', id'  <  leveled)  /*  lexicographically*/ 
then  Discard  the  message,  goto  R  ; 
else  /*  Candidate  id  is  eliminated  */ 

(1)  send(/d',/eve/')  over  /'; 

(2)  discard  all  future  messages; 

end  while; 

announce(ELECTED, terminate  the  algorithm) ,  STOP  ; 

Ordinary: 

for  ever  do; 

Ttce.ivz(id' , level')  over  /' ; 
case  level' ,  id'  of : 

(1)  level', id'  <  level ,  owner  -id: 

Discard  message  ; 

(2)  level', id'  >  level ,  owner  -id  : 

potential -father  4—  /  ' ; 
level  4—  level' ; 
owner  -id  4—  id'  ; 

if  father  =  nil  then  father  4—  potential  -father  ; 
send  (id', level')  over  the  father  link  ; 

(3)  level',  id'  =  level ,  owner  -id: 

father  4—  potential  -father  ; 
send  (id’,  level’)  over  the  father  link  ; 
end  case  ; 
end  for_ever  ; 

Figure  2.4:  Algorithm  A 


time  than  the  one  killing  it. 


To  prove  that  the  communication  complexity  of  the  algorithm  is  0(/i  logn) 
we  use  a  Lemma  which  was  introduced  in  [Gal77], 


Lemma  2.2:  For  anv  given  k .  the  number  of  candidates  that  own  —  or  more  nodes 
'  *  k 


is  at  most  k . 


Proof:  Let  C  \  and  C2  be  any  two  candidates  which  owned  y  nodes  at  some  point 

of  time.  We  shall  show  that  each  of  C  i  and  C  2  must  have  owned  at  least  ~  nodes 

disjointly.  If  they  never  tried  to  claim  a  node  from  each  other,  we  are  done.  The 
tirst  dme  that  C  t  (w.l.o.g.)  tries  to  claim  a  node,  say  v,  from  C2,  either  it  causes  the 
death  of  one  of  them,  or  C2  has  been  already  killed.  If  C  lf  w.l.o.g.,  caused  the  death 

of  then  clearly  it  must  have  owned  at  least  —  nodes  disjoint  from  C  2,  at  the  time 
“  k 

of  killing.  If  C  2  is  already  dead,  C  j  must  still  own  at  least  y  nodes  in  order  to 
claim  v  to  itself.  ■ 

Corollary  2.1:  The  largest  candidate  to  be  killed  by  another  candidate  owns  at  most 
y  nodes,  the  next  largest  owns  at  most  y  nodes,  etc. 

Lemma  2.3:  The  message  complexity  of  algorithm  A  is  4  n  ■  In  n 
(  =  2.773  n  log2« )  messages. 

Proof:  Since  in  capturing  one  node  a  candidate  makes  at  most  4  hops,  a  candidate 
which  owned  k  nodes  incurs  at  most  A-k  messages.  By  Corollary  2.1.  the  total  cost 

n  1 

is  then  bounded  by  4 —  messages.  Note  that  each  message  of  the  algorithm 

i=i  ‘ 

contains  at  most  21ogn  bits.  ■ 

The  number  of  candidates  at  a  particular  level  was  constrained  by  the  dis¬ 
jointness  property.  Hence,  a  candidate  which  captures  many  nodes  from  another 
candidate,  tries  to  eliminate  that  other  candidate  as  many  times  as  the  number  of 

nodes  it  captures  from  it.  This  gives  rise  to  the  factor  4  in  the  message  complexity. 
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In  the  next  algorithm  we  remove  the  disjointness  requirement  and  change  the  level 
function  to  reduce  the  message  complexity  to  2  n  log  n  messages. 

2.4.2.  Algorithm  B 

Level:  In  this  algorithm,  the  level  of  a  candidate  is  the  total  number  of  other  candi¬ 
dates  that  it  has  killed. 

Capturing  and  Elimination  Rule:  To  capture  node  v  the  level  of  a  candidate  must 
be  strictly  larger  than  that  of  v ,  in  which  case  the  candidate  captures  v  without  kil¬ 
ling  the  previous  owner  of  v . 

When  candidate  P  arrives  at  node  v  which  is  currently  owned  by  candidate 
Q ,  the  following  rule  is  used: 

If  Level  ( P )  <  Level  (v ),  P  is  killed. 

If  Level  ( P )  >  Level  (v ),  v  is  captured  by  P ,  and  v  gets  P  ’s  level. 

If  Level  {P )  =  Level  (v ),  P  is  sent  to  Q . 

Upon  arriving  to  Q : 

If  ( Level  (P  ),id (P ))  <  ( Level  ( Q  ),id ( Q )),  P  is  killed. 

If  Q  has  already  been  killed,  P  is  killed  too. 

If  (Level  (P ), id (P ))  >  (Level  (Q),id(Q)),  then  (1)  Q  is  killed,  (2)  P  increases  its 
level  by  one,  and  (3)  P  captures  v. 

Details 

A  formal  description  of  algorithm  B  is  given  in  Figure  2.5.  When  a  candi¬ 
date  arrives  at  node  v  whose  level  is  the  same  as  its  own,  and  the  id  of  v  's  father,  Q , 
is  smaller,  it  becomes  v 's  potential-father.  The  potential-father  is  then  sent  to  Q  in 
an  attempt  to  kill  it.  If  another  candidate  at  the  same  level  with  even  higher  id 


arrives  at  v  before  the  potential-father  returns  from  Q ,  then  this  other  candidate  is 
killed.  If  the  potential-father  survives  at  Q  it  first  increments  its  level  by  one,  then 
returns  to  v  and  captures  it,  and  only  then,  returns  to  its  initiating  node.  However,  if 
the  potential- father  finds  that  Q  is  already  killed,  it  eliminates  itself  as  well  (since  if 
Q  was  killed,  there  exists  a  higher  level  candidate  in  the  network). 

Analysis 

Since  at  most  half  of  the  candidates  at  level  k  go  up  to  level  k  +1,  the  max¬ 
imum  level  achievable  during  the  algorithm  is  log  n .  Clearly,  every  time  a  node  is 
recaptured  its  level  is  increased  by  at  least  one.  Hence,  the  total  number  of  capture 
messages  possible  is  at  most  n  log  n .  Each  capture  uses  2  messages,  which  sums  up 
to  a  total  of  2-n-logn  messages.  The  extra  messages  spent  by  candidates  which  go 
over  father  links  to  other  candidates  is  at  most  2  n ,  since  each  such  traversal  results 
in  the  elimination  of  one  live  candidate.  Thus,  the  message  complexity  of  the  algo¬ 
rithm  is  2-n-logn  +  2n  messages,  each  of  length  logn  +  log  log  n  bits. 


The  time  complexity  of  the  algorithm  is  <2 (n  logn)  by  the  following 
scenario,  in  which  y  of  the  nodes  are  captured  serially  log  (~)  times.  The  algo- 

rithm  is  started  by  node  v0  which  captures  y  nodes  in  level  0.  Then,  a  new  node, 
v  1(  spontaneously  starts  the  algorithm,  kills  v0,  increases  its  level  to  1  and  recaptures 
the  same  y  nodes.  After  v  t  has  captured  the  y  nodes,  two  new  nodes  spontane¬ 
ously  start  the  algorithm,  try  to  kill  each  other,  and  the  one  which  survives,  v;, 
reaches  level  1.  Node  v2  then  kills  v,  and  recaptures  the  nodes  at  level  2.  The 


scenario  continues  until  the  entire  network  has  been  captured  by  v  „  which  is 

•°g  ~ 
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*  The  variable  size  is  for  algorithm  C  only  */ 

Initially: 

level  «—  size  «—  0  ;  owner  -id  *—  potential  -id  0  ; 
untraversed  E  ;  father  <—  potential  -father  <—  nil 

Candidate  {id  ) : 
while  (  untraversed  *  0  )  do; 
e  <—  any(  un traversed  ) ; 
send  {level  ,id )  on  e  ; 

R:  recei ve{level  \id ')  over  e  ’ ; 

if  {id'  -  id)  then  /^successful  capturing.*/ 
level  <—  level' 

untraversed  <—  untraversed  -  e  ; 
else-if  {level', id'  <  leveled)  /*  lexicographically*/ 
then  Discard  the  message,  goto  R  ; 
else  (1)  send  (level', id')  over  e 
(2)  Discard  all  future  messages; 
end  while;  _ 

announce(ELECTED,  terminate  the  algorithm  ),  STOP  ; 


Ordinary: 
level  < — 1  ; 

while  (not  terminated)  do; 

Teceive(level',id')  over  e ' ; 
case  level'  of : 

(1)  level’  <  level  :  Discard  message  ; 

(2)  level'  >  level  :  /*  Replace  the  father  */ 

father  «—  e ' ;  level  <—  level' ;  owner -id  <—  id 
potential  -id  <—  0;  potential -father  <—  nil  ; 
send(level ' ,id ')  over  the  father  link  ; 

(3)  level'  -  level  : 

if  {id'  <  owner -id)  then  Discard  message; 
else-if  {id'  =  potential -id  )  then 
father  <—  potential  -father  ; 
level’  <—  level’  +1  ; 
owner- id  «—  id' ; 
potential  -id  <—  0  ; 
potential  -father  «—  nil  ; 
send  {level', id')  over  the  father  link  ; 
else-if  there  is  already  a  potential -father 
then  Discard  message  ; 
else  /*  there  is  no  potential  -father  * / 
potential  -id  <—  id' ; 
potential  -father  e ' ; 
send  {level', id')  over  the  father  link  ; 

end  case  ; 
end  while  ; 

Figure  2.5:  Algorithm  B 
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elected  as  a  leader. 


The  increase  in  time  complexity  of  the  algorithm  is  because  unlike  algorithm 
A,  the  level  of  a  candidate  here  is  not  a  function  of  the  number  of  nodes  it  has 
already  captured.  A  candidate  which  spent  a  lot  of  work  (and  time)  accumulating 
nodes  might  be  killed  by  a  candidate  which  did  not  spend  nearly  as  much.  Although 
algorithm  A  does  not  suffer  from  this  problem,  it  has  the  problem  that  candidates 
could  be  "killed"  many  times.  In  the  next  algorithm  we  eliminate  both  problems  by 
employing  both  techniques  simultaneously  in  one  algorithm. 

2.4.3.  Algorithm  C 

Here  we  make  two  modifications  to  algorithm  B  in  order  to  achieve  a  linear¬ 
time  complexity  with  no  increase  in  the  communication  cost.  First,  we  incorporate 
an  estimate  of  the  amount  of  work  spent  by  each  candidate  into  the  level  function  of 
algorithm  B.  Second,  we  enable  candidates  with  a  high  level  (>logn)  to  capture 
many  nodes  in  parallel  (in  one  time  unit).  We  start  describing  the  algorithm  with  the 
first  modification.  The  second  modification  will  be  introduced  during  the  perfor¬ 
mance  analysis. 

Level:  In  this  algorithm  the  level  of  a  candidate  is  increased  according  to  two  rules. 
First,  the  same  rule  as  in  algorithm  B  is  used,  and  second  after  each  capturing  the 
candidate  increases  its  level  to  be  at  least  log(  total  number  of  nodes  captured  ),  i.e., 
after  returning  from  a  successful  capture  the  level  is  set  to  MAX  (  log  (#  nodes  cap¬ 
tured),  present  level). 

Capturing  and  Elimination  Rule:  Same  as  in  algorithm  B. 

Details 
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The  formal  description  of  the  algorithm  is  similar  to  that  of  algorithm  B.  A 
variable,  called  size,  is  used  to  count  the  number  of  nodes  captured  by  a  candidate. 
The  only  change  is  to  replace  the  first  "then"  clause,  within  the  "while"  loop  of  a 


candidate  program  to: 


then 

size  size+l  ; 

level  <—  max(  level',  log  size)  ; 
untraversed  <—  untraversed  -  e  ; 


Analysis 


To  analyze  its  performances  we  will  first  show  that: 

Lemma  2.4:  The  maximum  level  reachable  during  any  execution  of  algorithm  C  is 
log  n  +  log  log  n  +  1. 


Proof:  Let  /V-  be  the  total  number  of  candidates  that  reach  level  i  during  the  execu¬ 
tion  of  the  algorithm.  Consider  the  maximum  number  of  candidates  which  could 
possibly  pass  from  level  i-1  to  level  i.  There  are  two  ways  in  which  a  candidate 
can  go  from  level  /-I  to  level  i.  First,  by  capturing  2‘-1  nodes  at  level  i-1  for 
i<log/i,  and  second,  by  killing  another  candidate  which  is  at  level  i-1.  We  note  that 
Nt  is  maximized  if  as  many  candidates  as  possible  pass  from  level  i-1  to  level  i  by 


capturing  other  nodes  (i.e.,  - -  candidates)  and  the  rest  of  the  candidates  (i.e., 

2<-i 


»V,_[ - — )  kill  each  other  in  pairs.  Hence, 


Solving  (1)  for  Nt  we  get: 


Substituting  Nt  =  1  in  (2)  and  solving  for  i  gives  us  the  maximum  level,  which  is 
log  n -flog  log  n+1.  ■ 

Using  the  same  argument  as  in  algorithm  B  we  find  that  the  message  com¬ 
plexity  of  algorithm  C  is  2  n  (log  n -flog  log  n +2)  messages,  each  of  length 
log  n  -flog  (log  n  -flog  log  n )  bits. 

With  the  above  modification  it  can  be  shown  that  the  time  complexity  of 
algorithm  B  is  reduced  to  O  (a  log  log  n).  In  order  to  further  reduce  the  time  com¬ 
plexity  to  O(n),  processes  at  levels  higher  than  log*  will  try  to  capture  * /log n 
nodes  in  parallel.  Thus  a  candidate  which  has  reached  level  logn  will  send  mes¬ 
sages  over  *  /log  *  untraversed  links  incident  to  it.  Each  of  these  messages  carries 
the  (level,  id)  of  the  candidate.  When  a  message  arrives  at  an  adjacent  node  the  node 
compares  its  level  to  that  of  the  message.  If  the  message  level  is  higher,  the  node 
replaces  its  (level,  id)  with  the  message,  thus  making  the  candidate  the  new  father  of 
the  node.  The  node  then  sends  the  candidate  an  acknowledgment  of  successful  cap¬ 
ture.  If  the  message  level  is  smaller,  it  returns  no  message.  Finally,  if  the  message 
level  is  the  same  as  that  of  the  node  but  the  message  id  is  higher,  a  notification  to 
that  effect  is  sent  back  to  the  candidate. 


The  candidate  waits  for  all  the  m\ogn  acknowledgments.  If  all  the  ack¬ 
nowledgments  indicate  a  successful  capture,  the  candidate  proceeds  to  the  next 
n/logrt  untraversed  incident  links.  If,  on  the  other  hand,  some  of  the  acknowledg¬ 
ments  indicate  that  they  have  encountered  the  same  level,  one  of  the  links  is  arbi¬ 
trarily  chosen  and  a  process  that  behaves  as  in  algorithm  B  is  sent  along  that  link.  If 
the  process  returns,  the  candidate  increases  its  level  and  proceeds  to  the  next  n  /log  n 
untraversed  links  (links  on  which  no  successful  capture  was  reported  are  not  con¬ 
sidered  traversed). 

To  analyze  the  algorithm  with  this  modification  we  make  two 
observations:  First,  the  maximum  attainable  level  in  the  algorithm  is  still  bounded 
by  logn+loglogn+1.  Second,  by  substituting  i  =  logn  in  equation  (2),  we  find  that 
the  maximum  number  of  candidates  which  reach  level  log  n  is  log  n . 

The  last  modification  has  increased  the  communication  complexity  of  the 
algorithm  by  at  most  O(n)  messages.  Each  node  is  still  captured  at  most 
log  n+log  log  n+1  times,  however  the  death  of  a  candidate  at  level  greater  than  log  n 
might  be  associated  with  at  most  2-n/logn  messages.  Since  there  are  at  most  logn 
such  candidates  the  increase  due  to  killings  is  bounded  by  O  ( n ). 

To  show  that  the  time  complexity  of  the  algorithm  is  0{n )  we  arrange  the 
candidates  in  a  rooted  tree.  Each  level  of  the  tree  corresponds  to  the  candidates  that 
have  reached  that  level  in  the  algorithm,  i.e.,  the  nodes  at  level  i  in  the  tree 
correspond  to  the  candidates  that  have  reached  level  i  in  the  algorithm.  The  parent 
of  a  candidate  at  level  i  in  the  tree  is  either  the  candidate  that  caused  the  death  of  the 
given  candidate  or,  the  same  candidate  at  the  next  ievel.  The  time  delay  of  the  algo¬ 
rithm  is  the  sum  of  the  delays  incurred  by  candidates  along  the  path  from  the  first 

initiator  (candidate)  to  wake  up,  at  level  0.  to  the  root. 
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To  evaluate  this  time  delay  we  note  that  no  candidate  that  either  survives  or 

is  killed  at  level  i  spends  more  than  2‘-1  time  units  in  level  i,  i<\ogn.  In  level  i , 

log  n  <i  <log  n+log  log  n  no  candidate  spends  more  than  log  a  time  units  since  it 

captures  nodes  at  a  rate  of  n  /log  n  per  time  unit.  Hence,  the  total  time  delay  of  the 

log  n 

algorithm  is  bounded  by  £  2‘  +  log  n  log  log  n  =  n+logn  loglogn.  Note  that  we 

1=1 

scale  a  time  unit  to  be  the  maximum  delay  it  takes  to  capture  one  node,  which  is  a 
constant. 

In  the  above  calculation  we  did  not  include  the  actual  time  it  takes  candidates 
to  kill  each  other.  Since  there  are  at  most  n  candidates  and  no  candidate  tries  to  kill 
a  dead  one  (unlike  algorithm  A),  this  delay  is  also  bounded  by  O  ( n ). 

2.5.  Conclusions 

An  0(n\ogn)  messages  O(logn)  time  synchronous  and  O(n  logn)  mes¬ 
sages  O  ( n )  time  asynchronous  election  algorithms  were  presented.  It  remains  open 
whether  a  0(n  logn)  messages  sublinear  time  asynchronous  election  algorithm 
exists.  We  conjecture  that  such  an  algorithm  does  not  exist  and  hence  that  the  syn¬ 
chronous  mode  of  communication  is  more  powerful  than  the  asynchronous  mode. 

Three  asynchronous  election  algorithms  (A,  B,  and  C)  were  presented.  The 
simplicity  of  the  complete  network  topology,  and,  hence,  of  termination  detection, 
enabled  us  to  concentrate  on  the  synchronization  among  contending  candidates. 
With  each  of  the  three  algorithms  we  can  associate  an  analogous  algorithm  for  arbi¬ 
trary  topology  networks,  which  uses  the  corresponding  method  to  synchronize  dif¬ 
ferent  initiations  of  the  algorithm  but  a  different  method  to  traverse  the  network  (i.e., 
to  detect  termination).  The  analogy  to  algorithm  A  is  given  in  [Gal77],  In  [Gal83| 
the  same  level  function  as  in  algorithm  B  was  used,  however,  there  candidates  merge 


their  "temtones''  (rather  than  kill  each  other)  when  they  meet.  In  [Gaf85]  the  time 
complexity  of  [Gal83]  is  improved  by  replacing  its  level  function  with  that  of  algo¬ 
rithm  C.  Each  of  the  methods  can  be  applied  to  other  classes  of  networks.  For 
example,  applications  of  method  B  in  different  classes  of  topologies  are  presented  in 
[Kor85].  As  in  algorithm  B,  the  time  complexity  of  the  algorithms  in  [Kor85]  is 
O(n  logn),  whereas  similar  applications,  but  of  methods  A  or  C,  improve  the  time 
complexity  of  [Kor85]  while  maintaining  the  message  optimality. 


CHAPTER  3. 


LOWER  BOUNDS  FOR  ELECTION  IN  COMPLETE  NETWORKS 
3.1.  Introduction 

Two  algorithms  for  election  in  synchronous  complete  networks  were  dis¬ 
cussed  in  the  previous  chapter.  The  first  is  a  0(n2)  messages  0(1)  time  algorithm 
and  the  second  is  a  0(n\ogn)  messages  O(logn)  time  algorithm.  The  two  algo¬ 
rithms  raise  two  questions:  (1)  Is  Q(n  logn)  also  the  lower  bound  on  the  message 
complexity  of  election  in  synchronous  complete  networks,  and  (2)  If  Q(n  logn)  is 
the  message  complexity  lower  bound,  then  how  fast  can  a  message-optimal  algo¬ 
rithm  be,  i.e.,  is  there  a  0(n  \ogn)  messages  0(1)  time  algorithm,  or  is  £2(logn)  is 
the  lower  bound  on  the  time  complexity  of  any  message -optimal  synchronous  algo¬ 
rithm. 

In  this  chapter  we  answer  these  question  by  proving  first,  that  Q(n  \ogn )  is  a 
lower  bound  on  the  worst  case  message  complexity  of  a  synchronous  algorithm,  and 
second,  by  proving  that  £2(log  n )  is  a  lower  bound  on  the  time  complexity  of  any 
message-optimal  synchronous  algorithm.  In  proving  these  bounds  we  do  not  restrict 
the  type  of  operations  performed  by  the  nodes.  The  bounds  thus  apply  to  general 
algorithms  and  not  just  to  comparison  based  algorithms.  This  proves  that  the  syn¬ 
chronous  algorithm  of  Chapter  2  is  optimal. 

V 

Furthermore,  in  Chapter  2  we  have  presented  a  continuum  of. - n  logrc 

log  c 

messages  2  log,  n  time,  synchronous  algorithms  where,  c  -  2.  ...  n .  In  Section 


3.2.2  each  algorithm  in  the  continuum  is  shown  to  be  optimal  by  proving  that  if  an 
algorithm  (whether  comparison  or  general)  elects  a  leader  in  at  most  -^-  log c  n 

c — 1 

rounds,  then  its  message  complexity  is  at  least  — - n  log  n . 

2-logc 

3.2.  Lower  Bounds 

To  show  the  lower  bounds,  a  scenario  in  which  any  synchronous  (and  hence 

also  asynchronous)  algorithm  transmits  at  least  y-log  n  messages,  is  constructed 

using  an  adversary  argument.  A  similar  argument  is  then  used  to  show  that  the  delay 
of  any  message-optimal  algorithm  is  at  least  O  (log  n )  rounds. 

3.2.1.  Definitions  and  Assumptions 

Consider  an  arbitrary  election  algorithm  on  the  synchronous  model  defined 
above.  An  event  is  the  sending  of  a  message  over  a  previously  unused  link  (Two 
messages  sent  in  the  same  round  in  opposite  directions  over  a  previously  unused  link 
are  considered  two  separate  events).  With  each  event  we  associate  a  pair  (s  ,d), 
where  s  is  the  source  node  and  d  is  the  destination  node  of  the  corresponding  mes¬ 
sage.  With  each  round  i  of  the  algorithm  we  associate  a  set  of  events,  A 
sequence  E=(RQlR  [,  •  •  •  )  is  called  an  execution.  An  execution-prefix  E}  is  a 

prefix,  (R0,R  i . Rj),  of  an  execution  E .  With  each  run  of  the  algorithm  we 

associate  an  execution,  called  a  legal -execution ,  that  includes  all  events  which 
occurred  in  the  run,  arranged  in  order  of  the  corresponding  rounds.  Henceforth,  any 
mention  of  a  message  refers  to  an  event. 

A  cluster  in  an  execution-prefix  E}  is  a  maximal  subset  of  nodes  spanned  by 
a  connected  subnetwork  of  links  which  were  used  by  events  which  occurred  in  Er 


The  degree  of  a  node  v  in  an  execution-prefix  £;  is  the  number  of  links  incident  to  v 
which  were  used  by  events  in  Ej .  The  potential  -degree  of  node  v  in  an  execution- 
prefix  Ej  is  the  degree  of  v  in  £ /  plus  the  number  of  times  that  v  is  a  source  node  of 
an  event  in  RJ+i.  The  potential -degree  of  a  set  of  nodes  is  the  maximum 
potential-degree  among  its  nodes. 

For  the  purpose  of  proving  the  lower  bounds  we  introduce  a  slightly  different 
model,  called  the  stopping  -model .  The  stopping  model  allows  us  to  withhold  the 
clock  pulse,  at  the  beginning  of  round  j  from  cluster,  C,  in  E} _1(  given  that  no  node 
in  C  is  expected  to  receive  a  message  in  round  j  from  a  node  not  in  C .  The  nodes  in 
C  are  then  said  to  be  frozen  in  round  j .  Therefore,  a  frozen  node  in  a  round  neither 
sends  nor  receives  any  message  in  that  round;  nor  does  it  change  its  state.  The 
stopping-model  will  be  used  to  prevent  large  differences  in  the  clusters’  growth 
rates. 

A  stopping  -execution  is  an  execution  which  corresponds  to  a  run  of  the 
algorithm  in  the  stopping-model.  A  stopping-execution  is  called  a 
k  stopping  -execution  if  the  cumulative  number  of  pulses  withheld  over  all  clusters 
throughout  the  run  is  k .  Obviously,  a  O-stopping-execution  is  a  legal-execution. 

Lemma  3.1:  For  any  k  stopping-execution  E ,  there  exists  a  k-\  stopping-execution 
E'  which  contains  exactly  the  same  events  as  E  does. 

Proof:  Let  l  be  the  minimum  index  of  a  round  in  which  any  cluster  is  frozen,  and 
let  C  be  a  cluster  which  is  frozen  in  / .  An  execution-prefix  E'  which  satisfies  the 
lemma  can  be  obtained  from  £  by  shifting  all  events  which  occurred  before  round  / 
and  involve  nodes  in  C.  one  round  forward.  This  affects  neither  any  event  in  later 
rounds  nor  anv  event  which  involves  nodes  not  in  C  Because,  neither  in  E  nor  in 


£  there  is  an  event  connecting  a  node  in  C  with  a  node  not  in  C  in  any  round 
R}  j<l,  and  because  R,^  in  E  is  identical  to  Rul  in  £'.  Notice  that  in  E'  the  nodes 
in  C  are  awakened  one  round  later  than  in  £ .  ■ 


Corollary  3.1:  For  any  stopping-execution  there  exists  a  legal-execution  which  con¬ 
tains  the  same  events. 


In  the  next  two  sections  we  will  prove  the  lower  bounds  on  the  stopping 
model.  Using  Corollary  3.1,  these  bounds  apply  also  to  the  non  stopping  model.  In 
our  proofs  we  do  not  restrict  the  type  of  operations  performed  by  the  nodes,  hence 
proving  the  bounds  for  general  algorithms. 


3.2.2.  A  Lower  Bound  on  Message  Complexity 


At  the  end  of  any  election  algorithm  all  nodes  know  who  the  leader  is,  hence 
any  such  algorithm  has  to  send  messages  along  the  links  of  a  spanning  subnetwork. 
In  other  words,  by  the  end  of  the  algorithm  the  whole  network  is  contained  in  one 
cluster.  Thus,  no  cluster  in  the  algorithm  can  defer  indefinitely  the  sending  of  mes¬ 
sages  to  nodes  not  in  the  cluster,  as  the  rest  of  the  network  might  not  wake  up  spon¬ 
taneously. 


In  the  following  proof  of  the  lower  bound  we  will  use  an  adversary  argument 
to  construct  a  stopping-execution  which  contains  at  least  -^-ulogn  events.  In  the 

beginning  of  each  round,  the  adversary  first  determines  which  clusters  to  freeze  and 
then  determines  the  destination  of  messages  sent  in  this  round  over  previously 
unused  links.  The  first  feature  ts  used  to  delay  the  formation  of  larger  clusters  until 
later  rounds  in  the  run,  thus  avoiding  large  differences  in  the  clusters'  growth  rates; 
the  second  feature  is  used  to  send  as  many  messages  as  possible  within  one  cluster. 


The  second  feature  is  possible  since  links  incident  to  a  given  node  on  which  no  mes¬ 
sage  was  sent  or  received  are  indistinguishable  to  this  node. 

Theorem  3.1:  A  stopping-execution  of  an  election  algorithm  in  a  synchronous  com¬ 
plete  network  of  n  nodes  contains  at  least  y  log  n  events,  in  the  worst  case. 

Corollary  3.2:  The  message  complexity  of  any  election  algorithm  in  a  synchronous 
complete  network  of  n  nodes  is  at  least  y  log  n. 

Proof  of  Theorem  3.1:  Assume  w.l.o.g.  that  n  =  2q .  We  define  a  sequence  of  par¬ 
titions  (P q,  .  .  .  ,Pq)  of  the  nodes  such  that  each  subset  in  partition  P0  contains  one 
node,  and  each  subset  in  P .  contains  two  subsets  from  Pj-\,  l£j<q.  Hence,  each 
subset  in  Pj  contains  2J  nodes. 

We  construct,  in  q  phases,  a  sequence  of  stopping-execution-prefixes 
(£,o,  .  .  .  ,  Et),  ;'0=0,  each  being  a  prefix  of  the  next.  £,o  is  an  empty  execution- 
prefix  in  which  all  nodes  have  been  awakened  and  the  potential-degree  of  each  node 
is  at  least  1.  This  is  done  by  withholding  the  clock  pulse  from  any  node  whose 
potential-degree  is  at  least  1  until  there  is  no  node  with  potential-degree  0.  Induc¬ 
tively  we  assume  that:  (1)  Any  cluster  in  £,•  is  contained  within  one  subset  in  P;. 
and  (2)  The  potential-degree,  in  of  every  subset  in  Pj  is  at  least  2J .  Obviously, 
£1q  satisfies  these  assumptions. 

Assuming  that  £,  t  has  been  constructed,  we  describe  how  the  adversary 
constructs  £, ,  j=\,  .  .  .  ,q-\.  In  each  round  of  phase  y,  we  freeze  all  the  subsets  in 
Pj  whose  potential-degree  >  V .  When  all  subsets  are  frozen,  phase  j  is  complete. 
The  source  and  destination  nodes  of  any  message  sent  in  this  phase  are  both  in  the 


same  subset  in  P r  This  is  always  possible  since  every  node  that  has  a  potential- 
degree  >  V  is  frozen.  Clearly,  Et  satisfies  the  inductive  assumptions.  In  the  q-th 
phase  no  freezing  takes  place.  After  that  phase,  the  network  is  contained  in  one  clus¬ 
ter  and  the  algorithm  is  assumed  to  produce  no  more  events. 

Clearly,  there  are  at  least  n/V  nodes  whose  degree  at  the  end  of  the  algo¬ 
rithm  is  at  least  2J ,  for  j=0,  .  .  .  ,q-\.  Thus,  the  total  number  of  events  is  at  least 

y  log  n.  m 

Given  that  the  message  complexity  of  any  election  algorithm  on  a  synchro¬ 
nous  complete  network  is  Q(n  log  n ),  the  question  arises  how  fast  can  a  message- 
optimal  algorithm  be.  In  the  next  section  we  prove  that  the  time  complexity  of  any 
message-optimal  algorithm  is  £2(log  n ). 

3.2.3.  A  Lower  Bound  on  Time  Complexity 

In  this  section  we  will  extend  the  techniques  of  the  previous  section  to  prove 
that  the  shorter  the  length  of  the  execution  the  larger  the  lower  bound  on  the  number 
of  events  it  must  contain. 

Theorem  3.2:  Any  stopping-execution  of  an  election  algorithm  in  a  synchronous 
complete  network  of  n  nodes  which  terminates  in  less  than  ~logc«  rounds,  con¬ 
tains  at  least  — — - — n  logn  events. 

2-logc 

Corollary  3.3:  The  time  complexity  of  any  message-optimal  election  algorithm  in  a 
synchronous  complete  network  of  n  nodes  is  Qdogrc  )  rounds. 


Proof  of  Theorem  3.2:  Consider  an  election  algorithm  whose  time  complexity  is  at 


most  ~ ‘log,, n .  Assume  w.l.o.g.  that  n=cq.  A  construction  similar  to  the  proof  of 

Theorem  3.1  will  be  used  here.  We  construct,  in  q  phases,  a  sequence  of  stopping- 
execution-prefixes  (£io,  ...,£,•  ), » o=0,  each  being  a  prefix  of  the  next,  and  a 
sequence  of  partitions  (Pq,  .  .  .  ,Pq),  the  subset  of  each  partition  containing  c  sub¬ 
sets  of  the  previous.  Each  subset  of  Pq  contains  one  node,  thus  each  subset  of  P, 
contains  cJ  nodes.  £,o  is  an  empty  execution-prefix  in  which  all  nodes  are  awak¬ 
ened  spontaneously.  Inductively  we  assume  that:  (1)  Any  cluster  in  £,,  is  con¬ 
tained  within  one  subset  of  P} ,  and  (2)  The  potential-degree  in  Et/  of  every  subset  in 
Pj  is  at  least  c1 .  Obviously,  £,o  and  P  0  satisfy  these  assumptions. 

Assuming  that  £,  i  has  been  constructed,  the  adversary  constructs  £,  by  first 

defining  the  subsets  of  P r  and  then  constructing  £,  .  Let  (5] . Sk),  k  =n/cJ~] 

be  the  subsets  of  Pj_x  indexed  in  nondecreasing  order  of  their  potential-degrees  in 
£,  r  Then  the  i-th  subset  of  Pj  is  defined  as  the  union  of  S(j_nc+1(  .  .  .S^. 
i=l,...,n/c;.  This  implies  that  if  subset  5  in  P}  contains  one  subset  from  t 
whose  potential-degree  is  at  least  cJ ,  then  all  subsets  from  in  S  have  potential- 
degree  at  least  cJ ,  with  the  exception  of  at  most  one  subset  of  Pj ,  called  the  boun¬ 
dary  subset. 

In  each  round  of  phase  j,  j= 1 . q- 1,  we  freeze  all  the  subsets  in 

whose  potential-degree  >  cJ .  When  all  subsets  are  frozen  phase  j  is  complete.  The 
destinations  for  messages  to  be  sent  by  node  v  are  selected  from  the  subsets  which 
included  v  in  partitions  P0 . P r  in  that  order  of  priority.  This  is  always  possi¬ 

ble  since  every  node  that  has  a  potential-degree  >  cJ  is  frozen.  Clearly,  £,  and  P ) 
satisfy  the  inductive  assumptions.  After  the  q-th  phase,  the  network  is  contained  in 
one  cluster  and  the  algorithm  is  assumed  to  produce  no  more  events. 


We  now  show  that  every  node  is  the  destination  of  at  least  —  (c-l)logcn 
events  in  E .  As  the  time  complexity  is  at  most  -y,  every  node  in  E ^  must  have 


been  frozen  in  all  the  rounds  of  at  least 


2 


phases. 


Otherwise,  the  legal-execution 


corresponding  to  E^  would  contain  more  than  y  rounds,  contradicting  the  assump¬ 
tion  on  the  time  complexity.  If  node  v  is  frozen  in  all  the  rounds  of  phase  j ,  it  will 
later  receive  one  message  from  every  subset  in  P j~\  which  does  not  contain  v  and  is 
with  v  in  a  subset  of  P}  (unless  v  is  in  a  boundary  subset).  Thus,  for  each  phase  that 
v  is  frozen  in  all  its  rounds,  v  is  the  destination  of  c-1  events.  The  total  number  of 

c- 1 

events  in  E.  is  thus  at  least - n  log  n-n  c .  The  term  n  c  is  due  to  the  nodes 

*  21og  c 

in  the  boundary  subsets  (since  due  to  the  boundary  subset  in  phase  j  at  most 
c 1  ~ 1  ■  (c  - 1 )  events  should  be  discounted).  ■ 


3.3.  Conclusions 


The  effect  of  synchronous  and  asynchronous  communication  on  the  problem 
of  distributively  electing  a  leader  in  a  complete  network  was  examined  in  the  last 
two  chapters.  On  the  one  hand,  it  was  proved  that  the  message  complexity  is  not 
affected  by  the  choice  of  the  communication  mode.  In  both  modes  of  communica¬ 
tion,  the  message  complexity  was  shown  to  be  ©(n-logn).  On  the  other  hand,  it 
remains  open  whether  or  not  the  choice  of  communication  mode  affects  the  time 
complexity  of  a  message-optimal  algorithm.  With  synchronous  communication,  the 
time  complexity  of  message-optimal  algorithms  was  proved  to  be  ©(log  a  ),  whereas 
with  asynchronous  communication,  only  an  O  (n  )  upper  bound  on  the  time  complex¬ 
ity  was  obtained.  The  lower  bound  on  time  for  asynchronous  communication 
remains  an  open  question  and  is  the  subject  of  the  following  conjecture: 
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Conjecture-.  The  time  complexity  of  any  message-optimal  asynchronous  elec¬ 
tion  algorithm  on  a  complete  network  is  ). 

The  implication  of  the  conjecture  is  that  synchronous  communication  is  fas¬ 
ter  by  a  factor  of  n/logn  than  asynchronous  communication.  An  analogous  result 
was  obtained  in  [Aij83],  where  a  particular  synchronous  system  of  parallel  proces¬ 
sors  was  proved  to  be  faster  by  a  factor  of  log  n  than  the  corresponding  asynchro¬ 
nous  system. 


CHAPTER  4. 


TRAVERSAL  OF  UNIDIRECTIONAL  NETWORKS 
4.1.  Introduction 

In  this  chapter  we  address  the  problem  of  traversing  a  unidirectional  network. 
In  the  traversal  problem,  one  node,  called  the  root ,  initiates  a  single  process  (token) 
which  has  to  visit  all  the  nodes  in  the  network,  one  at  a  time.  If  necessary  the  pro¬ 
cess  may  go  over  any  link  more  than  one  time. 

A  traversal  algorithm  is  an  identical  program  residing  at  each  node  in  the  net¬ 
work.  When  the  process  arrives  to  a  node,  the  program  decides  which  of  its  incident 
links  will  be  the  next  in  the  traversal.  The  algorithm  also  detects  when  the  process 
has  visited  all  the  nodes.  To  this  end,  the  program  in  each  node  uses  local  variables 
to  mark  the  node  and  its  incident  links.  The  marks  are  used  by  the  program  upon  the 
next  arrival  of  the  process.  In  addition,  the  nodes  of  the  network  may  use  the  pro¬ 
cess  to  carry  messages  between  them. 

We  consider  two  types  of  unidirectional  networks.  The  two  networks  differ 
in  the  amount  of  memory  available  for  the  algorithm  at  each  node.  In  the  first  model 
O  (logn+civ)  bits  of  memory  are  available  for  the  algorithm  in  node  v,  where  n  is 
the  total  number  of  nodes  in  the  network  and  d,  is  the  degree  of  node  v.  In  the 
second  model  only  Oid^)  bits  of  memory  are  available. 


Clearly,  to  be  able  to  traverse  a  bidirectional  network,  the  network  has  to  be 
connected.  Similarly,  to  traverse  a  unidirectional  network,  the  network  has  to  be 
strongly  connected,  i.e.,  there  should  be  a  directed  path  from  every  node  to  every 
other  node. 

£}(  J  E  1 )  is  obviously  a  lower  bound  on  the  number  of  messages  transmitted 
by  any  traversal  algorithm.  This  is  because  every  link  in  the  network  has  to  be 
traversed,  or  otherwise  an  untraversed  subnetwork  could  reside  on  the  other  side  of 
any  untraversed  link. 

The  problem  of  distnbutively  traversing  a  bidirectional  network  is  essentially 
the  same  as  the  centralized  problem  of  searching  a  graph  under  the  restriction  that 
any  two  consecutively  visited  nodes  are  neighbors  in  the  network.  One  search  algo¬ 
rithm  which  satisfies  this  restriction  is  the  Depth  First  Search  (DFS)  algorithm 
[Tar72,  Hop73]  and  thus  it  can  be  the  basis  for  a  distributed  bidirectional  network 
traversal.  In  the  resulting  traversal  the  process  makes  2- 1 E  |  hops,  and  the  number 
of  memory  bits  it  uses  at  each  node  is  linear  in  the  degree  of  the  node. 

In  the  bidirectional  DFS  algorithm,  the  root  spawns  a  process  which  visits  all 
the  nodes  in  the  network.  Upon  arriving  at  node  v  for  the  first  time,  say  through  link 
/,  the  process  marks  v  and  sequentially  traverses  each  of  v ’s  incident  links,  except  /. 
If  the  process  arrives  at  an  already  marked  node,  it  backtracks  to  the  node  from 
which  it  came.  After  backtracking  from  all  of  v’s  incident  links,  except  /,  the  pro¬ 
cess  backtracks  from  v  over  link  /.  The  traversal  is  completed  when  all  the  links 
incident  to  the  root  were  backtracked. 

In  solving  the  unidirectional  traversal  problem  we  would  like  to  adapt  the 
bidirectional  DFS  traversal.  However,  it  is  not  obvious  how  to  backtrack  in  a  uni- 


directional  network.  In  spite  of  this  difficulty,  the  final  unidirectional  traversal  algo¬ 
rithm  presented  in  this  chapter  is  based  on  the  DFS  algorithm.  The  difficulty  of 
backtracking  is  surmounted  by  constructing,  on  the  fly,  a  structure  called  in-directed 
forest.  An  in-directed-forest  is  a  subnetwork  in  which  there  is  a  unique  path  from 
any  fully-backtracked  node  to  exactly  one  visited  node  which  is  not  yet  fully- 
backtracked.  In  constructing  the  forest  we  will  use  a  technique  similar  to  the  one 
used  in  the  strongly  connected  components  algorithm  of  Hopcroft  and  Taijan 
[Hop73]. 

In  this  chapter,  three  traversal  algorithms  are  presented.  Traversal- 1  is  sim¬ 
ple  but  inefficient.  In  many  networks  the  process  of  Traversal- 1  hops  over  an 
exponential  number  of  links  before  terminating.  Traversal- 2,  which  is  based  on  the 
DFS  algorithm,  makes  at  most  0  (rt  •  |  E  \ )  hops  on  any  network.  Furthermore,  we 
show  that  D(n  ■  |  £  | )  is  a  lower  bound  on  the  number  of  hops,  in  general. 

In  both  Traversal- 1  and  Traversal-2  O  (log  n  +  cf)  bits  of  memory  are 
required  at  each  node  v  and  that  much  is  also  carried  along  with  the  traversing  pro¬ 
cess  (i.e.,  messages  size  is  D(logn  ^■driax)  bits,  where  d  ^  is  the  maximum 
degree  in  the  network). 

In  some  applications,  such  as  VLSI,  memory  size  and  message  length  are 
restricted  and  the  question  then  arises,  could  a  unidirectional  traversal  be  imple¬ 
mented  using  only  a  constant  number  of  bits  in  every  node  and  with  the  traversing 
process  (i.e.,  in  a  unidirectional  network  of  finite  automata).  We  answer  the  question 
positively  by  presenting  Traversal-3,  a  traversal  algorithm  for  unidirectional  net¬ 
works  of  finite  automata.  Traversal-3  makes  at  most  0{n\E  i+n:logn)  hops 
which  is  optimal  in  the  worst  case  (dense  networks,  in  which  j  E  \  =il(n  log  n  )). 


Throughout  the  discussion  we  make  a  distinction  between  unidirectional  and 
directed  networks.  A  unidirectional  network,  as  defined  before  is  a  network  in 
which  some  or  all  the  links  are  unidirectional  links.  A  directed  network  is  a  bidirec¬ 
tional  network  in  which  a  unique  direction  is  associated  with  each  link.  The  link 
directions  are  given  as  part  of  the  problem  definition. 

4.2.  Traversal- 1:  a  simple  traversal  algorithm 

The  algorithm  is  composed  from  two  mechanisms:  a  termination  detection 
mechanism,  and  a  routing  mechanism.  The  termination  detection  mechanism 
enables  the  traversing  process  to  detect  that  it  has  traversed  all  the  links  in  the  net¬ 
work.  The  routing  mechanism  is  used  at  each  node  to  select  the  next  link  on  which 
to  send  the  process  such  that  in  a  finite  number  of  hops  the  process  will  detect  termi¬ 
nation. 

The  termination  detection  mechanism  is  implemented  by  a  counter,  called  the 
debt -counter  which  is  carried  by  the  process.  The  counter  is  incremented  by  one 
whenever  the  process  arrives  at  a  node  for  the  first  time.  It  is  decremented  by  one 
just  before  leaving  a  node  through  its  last  untraversed  outgoing  link.  After  leaving  a 
node  at  least  once  through  each  of  its  out-going  links,  the  debt  counter  is  never 
changed  again  at  this  node.  To  start  the  traversal  the  root  initiates  the  debt  counter  to 
zero  and  sends  the  process  to  itself. 

Lemma  4.1:  The  debt-counter  returns  to  zero  when  and  only  when  all  links  are 
traversed. 

Proof:  Clearly,  for  each  newly  visited  node  the  process  increments  the  counter  by 

one.  Similarly,  for  each  visited  node  whose  outgoing  links  have  been  all  traversed. 

the  process  decrements  the  counter  by  one.  Hence,  when  all  links  are  traversed,  the 
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counter  has  been  incremented  and  decremented  the  same  number  of  times  and  is 
therefore  back  to  its  initial  value.  This  proves  the  sufficient  condition  ("when"). 

To  prove  the  "only  when"  assume  that  the  counter  has  returned  to  zero  but 
not  all  the  links  in  the  network  are  traversed.  Since  the  network  is  strongly  con¬ 
nected  there  must  be  an  untraversed  link  /  whose  tail  node,  v ,  was  visited.  Hence, 
the  counter  was  incremented  at  least  once  more  than  it  was  decremented,  which  leads 
to  contradiction.  ■ 

Using  Lemma  4.1,  the  process  may  traverse  a  unidirectional  network  and  tell 
whether  or  not  it  has  traversed  all  the  links.  However,  to  guarantee  termination,  the 
process  needs  to  employ  a  routing  rule.  The  routing  rule  will  route  the  process  to 
any  remaining  untraversed  link.  We  present  different  routing  rules,  which  lead  to 
different  traversal  algorithms,  presented  in  this  and  the  next  section. 

The  routing  rule  used  in  traversal- 1  is  the  following:  Every  node  orders  its  outgoing 
links  cyclically,  i.e.,  the  first  link  in  the  order  follows  the  last  one.  Each  time  that 
the  process  arrives  at  a  node  it  is  sent  out  on  the  next  outgoing  link  according  to  the 
cyclical  order. 

Lemma  4.2:  Using  the  above  routing  rule  the  process  eventually  traverses  all  the 
links. 

Proof:  Assume  the  contrary.  Then,  sir-'-,  the  network  is  strongly  connected,  there 
must  exist  an  untraversed  link,  /,  leaving  visited  node,  v,  whose  out  degree  is  d. 
Since  v  was  visited  once,  and  since  the  traversal  cannot  terminate,  v  will  be  visited 
infinitely  many  times.  Hence,  following  the  above  rule  /  must  be  selected  in  the 
d-th  visit  to  v.  Contradiction.  ■ 


Figure  4. 1 :  An  example  for  the  exponential  complexity  of  Traversal- 1 


Lemmas  4. 1  and  4.2  result  in  a  correct  traversal  algorithm.  Lemma  4. 1  pro¬ 


vides  for  termination  detection  while,  Lemma  4.2  enables  us  to  route  the  process  in  a 


way  which  guarantees  termination.  However,  the  communication  complexity  of  the 


resulting  algorithm  is  exponential,  as  will  be  argued  next. 


Figure  4. 1  is  an  example  of  a  network  on  which  the  algorithm  requires  2n 


link  traversals.  Let  N,  be  the  number  of  traversals  over  link  l  in  some  execution  of 


the  algorithm.  Clearly,  N 0=2-N  lL=2-N  and  A/i=2-A/(l+1)i<=2-Ar(l+i)^  for 


1=1  •  n-1.  Since  N ^  and  iV ^  are  both  equal  to  1  by  the  end  of  the  traversal, 

.V0=2',-‘. 


- \-v ..  v-v  v  >  -.  .. 
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Response  to  receiving  the  process  at  node  v . 

If  v  is  unmarked 
then  begin 

mark  v 

increment  the  Debt-Counter  on  the  process  by  1  - 
end 

else  if  Debt-Counter  =  0  then  stop 

let  /  be  the  next  link  in  the  cyclic  order  of  v 

If  /  is  unmarked 
then  begin 

mark  / 

if  /  was  the  last  unmarked  link  of  v 
then  decrement  the  Debt-Counter  by  1 
end 

Send  the  process  over  / 

Figure  4.2:  Traversal- 1 


The  source  of  the  traversal’s  inefficiency  is  the  routing  procedure.  In  the 
example  of  figure  4.1  the  process  hops  over  an  exponential  number  of  already 
traversed  links  before  traversing  the  last  untraversed  link.  A  different  routing  rule  is 
employed  in  the  next  section  to  derive  a  traversal  which  requires  0(n  \E  | )  hops. 


A  semi-formal  description  of  the  algorithm  is  given  in  figure  4.2.  Initially  all 
nodes  and  links  are  assumed  to  be  unmarked.  To  start  the  algorithm  the  root  initiates 
a  process  with  a  debt  counter  set  to  zero  and  sends  the  process  to  itself  (i.e.,  it  places 
the  process  in  its  input  queue). 


4.3.  Traversal-2:  Simulating  Directed  Depth  First  Traversal 


In  Traversal- 1,  each  time  the  process  arrives  at  a  node,  we  changed  the  link 
through  which  the  process  leaves  the  node.  In  the  following  algorithm  we  modify 
the  routing  strategy  to  use  the  same  link  to  leave  a  node  as  long  as  the  set  of  selected 


links  does  not  contain  a  cycle.  Upon  detecting  a  cycle,  we  select  another  untraversed 
outgoing  link  only  at  the  node  which  was  explored  last. 

We  present  the  algorithm  of  Traversal-2  in  three  stages.  First,  a  bidirectional 
depth  first  traversal  algorithm  is  described.  Second,  a  unidirectional  implementation 
of  the  first  algorithm  is  presented  by  assuming  that  a  structure,  called  spanning  in- 
directed  tree ,  is  predefined.  Finally,  a  mechanism  to  build  the  in-directed  tree  on  the 
fly  is  given,  thus  providing  a  unidirectional  traversal  algorithm. 

4.3.1.  Bidirectional  directed  depth  first  traversal 

In  a  bidirectional  directed  network  every  link  can  be  used  to  pass  messages  in 
both  directions,  however  an  arbitrary  direction  is  associated  with  it.  Here  we  assume 
that  the  directed  graph  resulting  from  the  directions  associated  with  the  links  is 
strongly  connected  (i.e.,  there  is  a  directed  path  form  every  node  to  every  other  node 
in  the  network). 

In  the  bidirectional  directed  depth  first  search  algorithm  [Hop73],  the  root 
spawns  a  process  which  visits  all  the  nodes  in  the  network.  Upon  arriving  at  node  v 
for  the  first  time,  say  through  link  /,  the  process  marks  v  as  active ,  /  as  the  father 
link  of  v,  and  iteratively  traverses  each  of  v’s  incident  out-going  links.  If  the  pro¬ 
cess  arrives  at  an  already  marked  node,  it  backtracks  to  the  node  from  which  it  came. 
After  backtracking  on  all  of  v’s  incident  out-going  links,  the  process  marks  v  as 
fully  -backtracked  and  backtracks  from  v  on  the  incoming  link,  l .  The  traversal  is 
completed  when  the  root  is  marked  fully-backtracked. 

A  formal  description  of  the  algorithm  is  given  in  figure  4.3.  Note  that  the 
process  passed  between  the  nodes  is  merely  a  token.  It  does  not  carry  any  informa¬ 
tion  except  its  actual  location.  Unlike  this  traversal,  in  the  next  sections  we  will  use 
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the  process  to  com-  essential  information  between  the  nodes. 


Initially  all  nodes  and  links  are  unmarked. 

To  start,  the  root  5  performs: 

mark  s  active  ; 

select  a  link,  outgoing  from  s  ; 

mark  1'  active  ; 

send  the  process  over  l ' ; 

Response  to  receiving  the  process  at  node  v  over  incoming  link  l . 

if  v  is  marked 
then 

send  the  process  back  over  l  ; 
else 

mark  v  active  ; 
mark  l  father  ; 

select  a  link,  /',  outgoing  from  v  ; 
mark  / '  active  ; 
send  the  process  over  / ' ; 
end 


Response  to  receiving  the  process  at  node  v  over  outgoing  link  / . 

mark  /  backtracked  ; 

If  there  is  an  unmarked  outgoing  link  / ' 
then 

mark  / '  active  ; 
send  the  process  over  / ' ; 
else 

mark  v  fully  -backtracked  ; 
if  there  is  no  father  link 
then  stop  ; 

else  send  the  process  over  the  father  incoming  link  ; 
end 


Figure  4.3:  The  bidirectional  directed  depth  first  traversal  algorithm 


The  following  two  observations  are  used  in  the  next  section  to  implement  the 
unidirectional  algorithm.  Let  the  father  node  of  every  node  v.  except  the  root,  be 
the  node  from  which  the  process  arrived  at  v  for  the  first  time.  At  any  given  time, 
the  link  through  which  the  traversing  process  left  an  active  node  for  the  last  time  is 


called  active  link . 


Obser\-ation  1:  The  active  nodes  together  with  the  active  links  form  a 
simple  directed  path,  called  the  active  path .  The  first  node  on  the  active 
path  is  the  root  and  the  last  link  in  the  path  either  closes  a  cycle  of  active 
links  (see  figure  4.4),  or  leads  to  a  fully-backtracked  node.  The  most 
recently  marked  node  among  the  active  nodes  is  called  the  focal  point  of 
the  traversal  (e.g.  see  figure  4.4). 

Observation  2:  All  backtrackings  are  over  the  last  link  of  the  active  path, 
i.e.,  either  from  an  active  node,  or  from  a  fully-backtracked  node  to  the 
last  active  node  on  the  active  path. 

Observation  1  follows  inductively  from  the  fact  that  every  node  has  at  most 
one  active  out-going  link,  and  if  it  has  one  it  must  have  an  active  incoming  link  (the 
father  link).  Observation  2  follows  immediately  from  observation  1  and  the  algo¬ 
rithm. 


For  the  sake  of  completeness  a  formal  proof  of  the  algorithm  is  included. 
The  proof  is  a  simple  modification  to  a  proof  given  in  [Eve79]  ,  the  proof  there  is  for 
the  centralized  undirected  DFS  algorithm.  Let  a  forward,  traversal  of  a  link  be  a 
traversal  in  the  direction  associated  with  the  link,  and  likewise,  a  backward  traversal 
of  a  link  be  a  traversal  in  the  opposite  direction.  First,  we  shall  prove  that  no  link  is 
traversed  more  than  once  in  each  direction  and  then,  that  if  the  underlying  graph  is 
strongly  connected  then  every  link  is  traversed  in  every  direction  (similar  to  [Eve79] 
). 


Lemma  4.3:  The  bidirectional  directed  depth  first  process  traverses  everv  link  in  the 
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Figure  4.4:  The  active  path 
network  at  most  once  in  each  direction. 

Proof:  By  the  algorithm  definition  the  process  is  never  sent  forward  more  than  once 
on  the  same  link.  Similarly,  a  non  father  link  is  traversed  backward  once  for  each 
forward  traversal.  Hence,  only  the  traversal  of  a  father  link  in  the  backward  direc¬ 
tion  still  needs  to  be  proved.  Assume  that  link  /,  directed  from  v  to  u  is  the  first 
father  link  to  be  traversed  backward  twice.  Since,  the  father  link  is  traversed  back¬ 
ward  only  after  the  process  has  backtracked  from  another  node  to  u  and  when  all  the 
out-going  links  of  u  are  marked,  the  process  must  have  backtracked  to  u  twice  on 
some  link.  However,  it  neither  could  backtrack  twice  on  a  non  father  link  (as  argued 
above)  nor  could  it  backtrack  twice  on  any  father  link  (by  the  assumption),  contrad- 
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icuon.  Hence,  every  link  is  traversed  at  most  once  in  every  direction. ■ 

Corollary  4.1:  The  depth  first  traversal  must  terminate. 

Lemma  4.4:  The  bidirectional  depth  first  process  traverses  every  link  in  the  network 
once  in  each  direction. 

Proof:  Since  the  network  is  strongly  connected  it  is  enough  to  prove  the  following 
claim:  For  every  node  all  its  incident  out-going  links  are  traversed  once  in  each 
direction. 

The  number  of  times  node  v  is  entered  via  its  out-going  and  father  links  is 
equal  to  the  number  of  times  v  is  left  via  these  links.  This  is  because  whenever  the 
process  arrives  at  v  and  v  is  marked,  through  an  incoming  link,  it  is  sent  back  on 
this  incoming  link  and  otherwise,  v  never  sends  the  process  over  an  incoming  link 
(except  the  father  through  which  it  was  also  entered).  Since  the  algorithm  ter¬ 
minates,  all  the  out-going  links  incident  to  the  root,  s ,  are  marked.  Obviously  no 
link  can  be  traversed  backward  before  it  has  been  traversed  forward,  thus  all  the 
incident  out-going  links  of  s  were  traversed  forward.  As  the  number  of  times  s  was 
left  through  an  out-going  link  equals  to  the  number  of  times  that  s  was  entered  on  an 
out-going  link,  and  no  out-going  link  could  be  entered  twice  (by  lemma  4.3),  the 
claim  holds  for  s . 

Assume  the  claim  does  not  hold  for  all  the  nodes.  Let  S  be  the  set  of  nodes 

for  which  the  claim  holds  (see  figure  4.5).  Since  the  network  is  strongly  connected 

and  s  €  S ,  there  must  be  a  link  /  from  v  to  u  such  that  /  is  the  father  link  of  u  and 

such  that  veS  and  usV-S.  Hence  u  was  backtracked  but  not  ail  of  its  incident 

out-going  links  were  traversed  once  in  each  direction.  Since  u' s  father  link  was 

backtracked  all  u' s  out-going  links  must  have  been  traversed  forward.  But.  the 
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Figure  4.5 

number  of  times  the  out-going  links  are  traversed  forward  equals  to  the  number  of 
times  they  are  traversed  backwards,  thus  by  lemma  4.3  ueS ,  contradiction.  ■ 

Corollary  4.2:  The  number  of  hops  made  by  a  traversing  process  in  the  bidirectional 
depth  first  search  is  exactly  2- 1 E  | . 

4.3.2.  Unidirectional  depth  first  traversal,  using  a  spanning  in-directed  tree 

In  this  section  we  use  the  two  observations  of  the  previous  section  to  imple¬ 
ment  the  bidirectional  depth  first  traversal  on  a  unidirectional  network  in  which  an 
in-directed  spanning  tree  is  defined.  The  traversal  requires,  in  the  worst  case,  n  •  |  E  | 
hops  (messages)  of  the  traversing  process. 


An  in-directed  tree  (or,  in-tree)  is  a  subnetwork  in  which  every  node,  except 
one  node,  called  the  root,  has  exactly  one  out-going  link  and  the  underlying 
undirected  graph  is  a  tree.  Since  every  node  in  the  in-tree  has  exactly  one  outgoing 
link,  there  is  a  unique  path  from  every  node  in  the  in-tree  to  the  root.  An  in-directed 
spanning  tree  is  an  in-tree  which  spans  the  network. 

The  difficulty  in  implementing  the  depth  first  traversal  on  a  unidirectional 
network  is  that  the  traversing  process  cannot  backtrace  on  any  link.  To  overcome 
this  difficulty,  we  note  that  whenever  the  process  wants  to  backtrack  over  link  l ,  a 
directed  cycle,  called  the  backtracking  cycle  is  defined  by  concatenating:  /,  the 
unique  path  in  the  in-tree  from  the  head  node  of  /  to  the  root,  and  the  active  path. 
Thus,  to  backtrack  over  link  /  (from  the  head  node  of  /  to  its  tail  node)  the  process 
goes  along  the  backtracking  cycle  until  it  arrives  at  the  tail  node  of  /.  To  this  end, 
the  unique  ids  of  each  node  are  used  by  the  process  to  identify  the  tail  node  of  / . 
Note  that  shortcuts  are  possible  if  the  unique  path  of  the  in-tree  intersects  the  active 
path  before  reaching  the  root  (i.e.,  whenever  the  cycle  is  not  simple).  In  particular,  if 
the  head  node  of  l  is  active  the  process  needs  to  follow  only  the  active  path  in  order 
to  backtrack  to  /  ’s  tail  node. 

A  formal  description  of  the  traversal  algorithm  is  given  in  figure  4.6,  ignor¬ 
ing  the  lines  marked  with  *.  The  lines  marked  "1"  are  the  code  that  the  initiator  has 
to  execute  in  order  to  start  the  traversal. 

To  implement  the  backtracking  mechanism,  whenever  the  traversing  process 

traverses  link  /  from  node  v  to  node  u  it  carries  the  id  of  v.  If  u  is  unmarked 

(unvisited  yet)  then  node  u  remembers  that  v  is  its  father.  If  node  u  is  already 

marked,  the  process  follows  the  cycle  until  it  arrives  back  to  v.  When  node  u 

becomes  fully-backtracked  the  traversing  process  is  sent  along  the  backtracking 
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cvcle  to  u  's  father,  v . 


Lemma  4.5:  The  number  of  hops  made  by  a  traversing  process  in  the  unidirectional 
depth  first  traversal  is  at  most  n  •  |  E  | . 

Proof:  In  Lemma  4.2  we  saw  that  every  link  is  backtracked  exactly  once.  The 
lemma  follows  since  in  each  such  backtracking  the  process  goes  around  a  cycle  of 
length  at  most  n .  ■ 


In  Section  4.5  it  will  be  shown  that  Q(n  ■  |  E  | )  is  also  the  lower  bound  on  the 
number  of  hops. 

The  communication  cost  of  the  traversal  has  two  components,  one  is  due  to 
the  hops  that  the  process  makes  in  the  forward  mode,  and  the  other  is  due  to  the  hops 
that  the  process  makes  in  the  backtrack  mode.  The  process  backtracks  over  |  E  \ 
links.  Each  backtracking  requires,  in  the  worst  case,  the  traversal  of  an  O  (n )  long 
cycle.  In  each  hop  that  the  process  makes  it  carries  O  (log  n )  bits  which  are  used  to 
identify  the  node  which  it  wants  to  reach.  The  forward  hops  incur  O  ( |  E  \  -log  n ) 
bits,  while  the  backtracking  hops  add  0(n-\E  \  log n )  bits  to  the  communication 
complexity  of  the  traversal  (the  number  of  bits  transmitted  by  the  algorithm). 

4.3.3,  On  the  fly  in-tree  construction 

In  this  section  the  assumption  of  the  previous  subsection,  that  an  in-tree  is 
predefined  on  the  network,  is  relaxed.  The  assumption  is  relaxed  by  constructing  the 
in-tree  on  the  fly,  while  the  process  is  traversing  the  network. 

The  essential  use  of  the  in-tree  in  the  previous  section  was  to  backtrack  from 
a  fully-backtracked  node.  Backtrackings  from  active  nodes  could  be  accomplished 
by  using  only  the  active  path.  Thus,  every  node  decides  on  its  unique  out-going  link 


in  the  in-tree,  called  intree ,  while  it  is  active.  While  active  a  node  might  change  the 
intree  mark  a  few  times.  The  intree  link  of  a  fully-backtracked  node  does  not  change. 


The  basic  idea  of  the  in-tree  construction  is  as  follows;  Every  node 
remembers  whether  or  not  its  father  incoming  link  has  already  participated  in  a  back¬ 
tracking  cycle.  When  the  father  incoming  link  of  node  v  participates  in  a  backtrack¬ 
ing  cycle  for  the  first  time,  all  the  active  nodes  from  v  to  the  end  of  the  active  path, 
select  their  present  active  link  as  their  in-tree  link.  In  the  rest  of  this  chapter  a  father 
link  which  has  never  participated  in  a  backtracking  cycle  is  called  a  bridge . 

Aside  form  the  in-tree  construction,  the  traversal  is  the  same  as  the  depth  first 
traversal  of  the  previous  subsection.  It  is  assumed  here,  that  whenever  a  shortcut  in 
the  backtracking  cycle  is  possible  it  is  done,  i.e.,  the  backtracking  cycle  is  a  simple 
directed  cycle. 

The  mechanism  to  construct  the  in-tree  can  be  viewed  as  an  approximation  of 
the  mechanism  to  determine  the  low-points  of  vertices  in  a  directed  graph,  which 
was  introduced  by  Hopcroft  and  Taijan  [Hop73]  in  their  algorithm  for  strongly  con¬ 
nected  components. 

A  formal  description  of  the  traversal  algorithm,  with  the  in-tree  construction, 
is  given  in  figure  4.6.  The  lines  of  the  algorithm  are  marked  *,  +  and  I,  to  indicate 
the  following;  the  *  lines  implement  the  in-tree  construction,  i.e.,  by  taking  out  the 
*  lines  one  obtains  the  unidirectional  traversal  algorithm  given  an  in-tree.  The  "I" 
lines  are  the  steps  which  the  root  executes  in  order  to  start  the  algorithm.  The  + 
marks  will  be  explained  in  the  next  subsection.  Basically  they  indicate  the  lines  in 
which  a  variable  of  length  0  (log  n )  bits  is  used. 


In  this  algorithm  node  v  knows  which  of  its  incoming  links  is  the  father  link 
by  recording  the  id  of  the  node  on  the  other  side  of  the  father  link,  this  is  the  father 
node  of  v .  The  first  time  that  a  father  incoming  link  of  node  v  participates  in  a  back¬ 
tracking  cycle  is  easily  detected,  as  it  is  exactly  the  second  time  that  the  process 
arrives  at  v  through  this  link.  To  this  end,  a  boolean  variable,  called  BrgHd  (Bridge 
Head),  is  used  at  every  node  to  indicate  whether  or  not  its  father  incoming  link  is  a 
bridge  (i.e.,  if  it  has  already  participated  in  a  backtracking  cycle).  Another  boolean 
variable,  called  XBrdg  (crossed  bridge),  is  used  on  the  traversing  process  to  indicate 
whether  or  not  a  bridge  is  participating  in  the  backtracking  cycle.  Whenever  the  pro¬ 
cess  arrives  at  an  active  node  v ,  in  the  backtracking  mode,  and  the  XBrdg  indicator 
is  on,  v  selects  its  active  link  to  be  its  intree  link.  Aside  from  BrgHd,  every  node  v 
has  the  following  fields:  id  which  is  the  id  of  v ,  father  which  is  the  id  of  the  father 
node  of  v ,  activelink  which  points  to  the  activelink  of  v ,  intree  which  points  to  the 
intree  outgoing  link  of  v .  Aside  form  XBrdg,  the  traversing  process,  P ,  has  the  fol¬ 
lowing  fields:  mode  which  indicates  whether  the  process  is  now  backtracking  or 
not,  PreviousNode  which  is  the  id  of  the  node  that  P  visited  last,  FocalPoint  which 
is  used  in  the  backtracking  mode  and  is  the  id  of  the  backtracking  destination  node. 

It  remain  to  prove  that  the  intree  links  selected  by  any  fully-backtracked  node 
span  all  these  nodes  and  always  lead  to  an  active  node. 

Let  us  define  an  in-directed  forest  as  a  collection  of  disjoint  in-trees. 

Lemma  4.6:  The  intree  marked  links  of  the  fully-backtracked  nodes  constitute  an 
in-directed  forest. 

Before  proving  the  lemma  we  note  the  following  two  implications  of  its  pro¬ 
position;  First,  if  there  are  still  active  nodes,  then  the  roots  of  in-trees  in  the  forest 
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Response  to  receiving  the  traversal  process,  P ,  at  node  v : 

If  v  is  an  unvisited  node: 

-  v  .father  P  .PreviousNode  ; 

*  v .BrgHd  true  ;  {BrgHd  since  v ’s  father  link  have  not  yet  been  on  a  cycle} 

I  v  .activelink  «—  any  unused  out-going  link  ; 

I  mark  v  active  ; 

1+  P  .PreviousNode  <—  v  .id  ; 

I  send  P  over  v  .activelink  ; 

If  v  is  an  Active  node  and  P  is  in  the  Forward  mode: 

P  .mode  <—  backtrack  ; 

+  P  .FocalPoint  <—  P  .PreviousNode  ; 

+  P  .PreviousNode  <—  v  .id  ; 

send  P  over  v  .activelink  ; 

If  v  is  an  Active  node  and  P  is  in  the  backtrack  mode: 

*  if  P  .XBrdg  then  v  .intree  <—  v  .activelink  ; 

+  if  v  .id  =  P  .FocalPoint 

then  begin  {  v  is  the  destination  of  the  backtracking  } 

*  P  .XBrdg  <—  false  ; 

if  there  are  unused  out-going  links 
then  begin 

v  .activelink  <—  any  unused  out-going  link  ; 

P  .mode  Forward 
end 

else  begin;  {  No  more  unused  out  going  links  :  } 
mark  v  fully-backtracked ; 
if  v  has  no  father  then  STOP  ;  {  v  is  the  root  } 

+  P  .FocalPoint  <—  v  .father  ; 

end  end 

*  else  {In  the  middle  of  backtracking,  on  the  Active  path  } 

*  if  (v  .BrgHd)  and  (  P  .PreviousNode  =  v  .father) 

*  then  begin  { 1-st  time  that  v ’s  father  link  is  on  a  backtracking  cycle} 

*  P  .XBrdg  <—  true  ; 

*  v  .BrgHd  <—  false  ; 

*  end 

+  P  .PreviousNode  <—  v  .id  ; 

if  v  is  still  marked  active 
then  send  P  over  v  .activelink  ; 
else  send  P  over  v  .intree  ; 

If  v  is  a  Fully* Backtracked  node: 
if  P  .mode  =  Forward 
then  begin 

P  .mode  <—  backtrack  ; 

+  P  .FocalPoint  <—  P  .PreviousNode  ; 

end  ; 

send  P  over  the  intree  link  ; 

Figure  4.6:  Traversal-2,  The  unidirectional  depth  first  traversal  algorithm 


must  be  active  nodes,  which  is  exactly  what  we  need  for  the  backtracking  process. 
This  is  because  all  the  fully-backtracked  nodes  have  an  intree  out-going  link  which 
obviously  cannot  lead  to  an  unvisited  node.  Second,  the  proposition  implies  that 
when  the  algorithm  terminates  the  intree  links  constitute  an  in-directed  tree  rooted  at 
the  traversal  initiator,  the  root. 

Proof  of  lemma  4.6:  The  claim  will  be  proved  by  induction.  Clearly,  the  lemma 
holds  when  the  algorithm  starts  at  a  time  when  no  node  is  fully-backtracked. 
Assume  that  the  claim  holds  just  before  node  v  becomes  fully-backtracked,  and  we 
will  prove  that  it  holds  after  v  becomes  fully-backtracked. 

If  v  is  the  root  then  the  claim  certainly  holds,  since  v  selects  no  in-tree  link. 
Henceforth  v  is  not  the  root,  and  when  v  becomes  fully-backtracked  there  is  at  least 
one  active  node  in  the  network. 

Assume  to  the  contrary  that  after  v  becomes  fully-backtracked  the  claim  does 
not  hold.  Let  T  and  5  be  the  sets  of  nodes  which  were  explored  before  and  after  v , 
respectively  (see  figure  4.7).  Let  L  be  the  set  of  links  which  are  directed  from  a 
node  at  5  to  a  node  at  T.  Gearly,  L*0  since  the  network  is  strongly  connected.  By 
the  definition  of  5  and  T  there  is  no  traversed  link  from  T  to  S  except  the  father  link 
of  v .  Thus,  all  the  in-trees  in  T  are  rooted  at  active  nodes  in  T  and  the  backtracking 
cycle,  Q,  which  is  associated  with  each  /sL,  passes  from  T  to  S  on  the  father  link 
of  v.  Clearly,  at  least  one  Cti  leL  passed  on  a  bridge.  Let  the  u  to  w  link,  /*,  be 
that  link  in  L  whose  associated  backtracking  cycle,  C,*,  was  the  most  recent  to  pass 
over  a  bridge  among  the  backtracking  cycles  associated  with  the  links  in  L .  Then, 
all  the  links  from  v  to  w  on  C;.  must  have  been  marked  as  intree  links.  By  the 
definition  of  /  *  none  of  these  marks  could  be  removed  in  the  future.  Contradiction. 


Figure  4.7 

4.4.  Traversal-3:  an  algorithm  for  a  network  of  finite  automata 

In  this  section  the  assumption  that  every  node  has  O  (log  n )  bits  of  memory  is 
relaxed.  Instead,  every  node  is  assumed  to  be  a  finite  automaton,  i.e.,  to  have  con¬ 
stant  size  memory  regardless  of  the  network  size.  Since  each  node  has  a  constant 
number  of  memory  bits,  the  traversing  process  has  to  be  of  constant  size  too.  We 
will  show  that  with  a  constant  size  process  the  traversal  requires  at  most 

0(n  \E  \  +n  2log  n  )  hops  which  is  also  the  bit  complexity  of  the  algorithm. 

78 


Traversal-3  is  developed  in  two  steps;  First,  the  bit  complexity  of  Traversal-2 
is  reduced  to  0{n-  E  +n-logn)  (  from  0(rr  E  log n)  )  and  second,  an  imple¬ 
mentation  with  constant  size  memory  is  presented.  In  the  first  step  it  is  shown  that 
only  in  0{n2)  out  of  a  total  of  0(n-\E  1)  hops  the  traversing  process  has  to  carry 
<3(logrc)  bits  (in  the  rest  it  carries  0(1)  bits).  In  the  second  step  the  0(n  2)  hops  of 
size  O  (log  n )  are  replaced  by  O  ( n  2log  n )  hops  of  a  constant  size  process. 

4.4.1.  Reducing  the  communication  complexity  of  Traversal-2  to 

O  (n  ■  j  E  j  +n  2log  n  )  bits 

In  this  section  Traversal-2  is  modified  so  in  at  most  2n-2  of  its  backtrack¬ 
ings  the  process  will  carry  0  (log  n )  bits  around  the  backtracking  cycle.  In  the  rest 
of  the  backtrackings  the  process  need  not  carry  more  than  a  constant  number  of  bits. 
Thus  reducing  the  bit  complexity  from  0  (n  ■  \  E  \  log  n )  to  O  (n  •  |  E  |  +n  2  log  n ).  A 
variation  of  the  traversal  presented  here  is  also  presented  at  the  end  of  Chapter  5  in  a 
much  different  setting. 

The  modification  of  the  algorithm  is  as  follows: 

1.  Every  active  node  uses  a  boolean  variable,  called  the  focal  point,  to  assert 
whether  or  not  it  is  the  focal  point  of  the  traversal.  If  the  focal  point  variable 
of  node  v  is  false  then  v  is  not  the  focal  point  of  the  traversal.  When  v  is 
visited  for  the  first  time  it  sets  its  focal  point  to  true.  When  v  becomes 
fully-backtracked  it  sets  its  focal  point  to  false. 

2.  Whenever  the  process  arrives  in  the  forward  mode  at  marked  node,  v ,  a  two 
phase  backtracking  is  started.  In  the  first  phase  the  process  is  sent  around  the 
backtracking  cycle  and  back  to  v,  counting  whether  there  is  one  or  more 
nodes  on  the  backtracking  cycle  whose  focal  point  is  true.  To  this  end  onlv  a 


constant  number  of  bits  has  to  be  earned  around  by  the  process. 


If  only  one  node  on  the  cycle  has  its  focal  point  "on"  then,  this  node  must  be 
the  node  preceding  v  on  the  cycle  (see  figure  4,8  case  1),  i.e.,  it  is  the  back¬ 
tracking  destination.  In  this  case,  the  process  is  sent  around  the  backtracking 
cycle  again  with  a  constant  number  of  bits,  to  the  unique  node  which  asserts 
to  be  the  focal  point.  Hence,  a  complete  backtracking  is  performed  with  the 
process  carrying  only  a  constant  number  of  bits. 


Figure  4.8:  Backtracking  in  Traversal-3 
If  more  than  one  node  on  the  backtracking  cycle  has  its  focal  point  "on"  then 
a  bridge  was  included  in  the  backtracking  cycle  (see  case  2  figure  4.8),  and  a 
backtracking  identical  to  the  one  used  in  Traversal-2  is  initiated  by  v  (with 
XBrdg  set  to  true).  In  this  phase  of  the  backtracking  all  the  nodes,  aside  from 
the  last  one  on  the  active  path,  set  their  focal  point  to  false.  A  bridge  is 
included  since  the  active  link  leaving  any  node  whose  focal  point  is  true. 


aside  from  the  last  one,  must  have  led  to  a  new  node  (i.e.,  it  is  the  father  link 
of  the  next  node  on  the  active  path)  and  has  never  been  on  a  backtracking 
cycle  before  (otherwise  the  focal  point  would  not  have  been  true). 

5.  The  backtrackings  from  a  fully-backtracked  node  remains  the  same  as  in 
Traversal-2  except  that  the  focal  point  of  the  destination  is  set  to  true. 

The  main  claim  of  this  subsection  is, 

Lemma  4.7:  By  the  above  modification  the  process  will  have  to  carry  O  (log  n )  bits 
around  a  backtracking  cycle  only  2n-2  times. 

Proof:  The  process  has  to  carry  0  (log  n )  bits  around  the  backtracking  cycle  either 
when  it  backtracks  from  a  full,  -backtracked  node  to  its  father,  or  when  the  back¬ 
tracking  cycle  goes  over  some  bridge  for  the  first  time.  Thus  we  can  associate  one 
such  backtracking  with  each  node  that  becomes  fully-backtracked,  and  one  with  each 
bridge.  Clearly  there  are  n-1  nodes  which  become  fully-backtracked  (except  the 
root  from  which  the  process  never  backtracks).  Similarly,  there  are  n-\  bridge  links 
since  each  such  link  is  the  unique  incoming  father  link  of  some  node  (except  the  root 
which  has  no  bridge  link  entering  it).  ■ 

In  the  remaining  |  E  |-2n+2  backtrackings  the  process  carries  only  a  con¬ 
stant  number  of  bits.  Since  every  backtracking  requires  at  most  n  hops  we  get: 

Corollary  4.3:  The  communication  complexity  of  the  modified  Traversal- 2  is 
Oin-'-E  '+rc2-Iogrt)  bits. 


4.4.2.  A  finite  automata  implementation  of  Traversal-2 


In  this  section  we  show  how  each  of  the  backtrackings  in  Lemma  4.7  can  be 
implemented  by  a  constant  size  process  which  will  go  around  the  backtracking  cycle 
O  (log  n )  times. 

In  each  of  these  backtrackings  a  designated  node  on  the  backtracking  cycle 
sends  the  process  to  the  node  preceding  it  on  the  cycle.  The  O  (log  n )  bits  were  used 
to  identify  the  preceding  node. 

To  recognize  the  preceding  node  on  the  backtracking  cycle  using  a  constant 
size  process,  we  use  a  solution  to  the  following  "last  in  the  ring"  puzzle:  In  a  uni¬ 
directional  ring  of  finite  automata,  design  an  algorithm  by  which  a  designated  node, 
v ,  will  distinguish  the  node  preceding  it,  u ,  from  all  other  nodes. 

Lemma  4.8:  The  upper  and  lower  bound  on  the  bit  complexity  of  the  "last  in  the 
ring"  puzzle  is  £i(n  log  n ). 

Proof:  A  solution  to  the  puzzle  works  in  phases.  Initially,  all  nodes  except  v  are 
candidates  for  the  position.  In  each  phase  we  eliminate  half  of  the  remaining  candi¬ 
dates  by  sending  a  token  around  the  ring,  alternately  marking  the  candidate  nodes 
even  and  odd.  When  the  token  arrives  at  v ,  it  remembers  the  parity  of  the  candidate 
preceding  v.  In  the  next  phase,  the  token  eliminates  all  candidates  whose  parity 
differs  from  the  parity  of  the  desired  node.  The  last  phase  is  detected  by  the  token 
when  it  sees  that  only  one  node  has  not  been  eliminated.  Thus  the  token  carries  one 
more  bit  to  indicate  whether  there  are  one  or  more  uneliminated  nodes  on  the  cycle. 
Hence,  O  (n  log  n )  is  an  upper  bound  on  the  bit  complexity  of  the  puzzle. 
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To  prove  that  this  is  also  the  lower  bound  we  note  that  if  all  n  links  have 
seen  fewer  than  log  n  bits,  then  there  are  two  distinct  links,  uq->u  lt  and  v0-»v  ,, 
such  that  both  have  seen  the  same  sequence.  Hence,  u  L  and  v  j  are  in  the  same  state 
and  both  have  generated  the  same  sequence  on  their  out  going  links,  which  implies 
that  their  down  neighbors  u2  and  v2  are  also  in  the  same  state.  Continuing  this  argu¬ 
ment  inductively  we  conclude  that  the  node  preceding  the  designated  one  has  to  be  at 
an  equal  distance  from  both  u0  and  v0.  This  is  a  contradiction;  hence,  Q(n  logn )  is 
also  the  lower  bouncLB 

Thus,  whenever  node  v  becomes  fully-backtracked  or  a  backtracking  cycle 
with  more  than  one  focal  point  is  closed  at  v ,  v  starts  the  algorithm  described  in  the 
proof  of  Lemma  4.8  to  send  the  process  to  the  node  preceding  it  The  total  bit  com¬ 
plexity  of  the  traversal  does  not  change  with  this  modification;  however,  the  number 
of  hops  made  by  the  process  is  linear  with  the  bit  complexity  of  the  traversal,  i.e., 
0{n  \E  l+n2  logn). 

4.5.  Lower  Bounds 

Arriving  at  an  up,xr  bound  of  0(n\E  |+n2  logn)  bits  on  the  communica¬ 
tion  complexity  for  traversal,  we  wonder  what  is  the  lower  bound.  Much  research  is 
still  needed  in  establishing  tight  lower  bounds  on  the  unidirectional  traversal  prob¬ 
lem  in  general.  In  this  section  we  present  one  step  in  this  direction.  We  show  that 
Q(n  •  |  E  | )  is  a  lower  bound  on  the  number  of  hops  required  by  a  single  token  traver¬ 
sal,  i.e.,  when  the  traversal  is  restricted  to  send  at  most  one  message  at  a  time. 

Lemma  4.9:  Cl(n  •  \  E  | )  bits  is  the  lower  bound  to  a  single  token  traversal  of  a  uni¬ 
directional  graph  of  arbitrary  topology. 


Proof  :  (by  example,  figure  4.9).  The  result  follows  since  each  traversal  of  a  link 


c 
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Figure  4.9:  A  network  for  the  Q(n  •  |  E  | )  lower  bound 
from  A  to  B  must  be  followed  by  a  traversal  of  the  path  C.  ■ 

Lemma  4.9  proves  that  our  traversal  algorithm  is  optimal  for  dense  networks 
(in  which  |  E  \  =Q(n  log  n ))  and  under  the  restriction  that  the  traversal  have  at  most 
one  outstanding  message  at  a  time.  Furthermore,  it  suggests  that  the  algorithm  is 
optimal  in  the  general  case. 

4.6.  Applications 

Traversal-2  and  -3,  which  are  different  implementations  of  the  same  algo¬ 
rithm,  can  be  modified  to  produce  a  useful  structure,  called  infrastructure ,  on  the 
network.  The  infrastructure  is  the  combination  of  an  in-directed  spanning  tree  and 


an  out-directed  spanning  tree.  The  in-tree  construction  was  detailed  in  section  4.3.3. 
Here  we  will  describe  how  an  out-tree  may  be  produced  by  Traversal-2.  Note  that 
the  defined  infrastructure  is  a  strongly  connected  subnetwork  which  spans  the  net¬ 
work  and  has  at  most  2n  links.  The  infrastructure  proves  to  have  several  practical 
applications. 

4.6.1.  Producing  a  spanning  out-tree 

An  out-directed  tree  (or,  out-tree)  is  a  subnetwork  in  which  every  node, 
except  one,  called  the  root,  has  exactly  one  in-coming  link  and  the  underlying 
undirected  graph  is  a  tree.  Since  every  node  in  the  out-tree  has  exactly  one  in¬ 
coming  link,  there  is  a  unique  path  from  the  root  to  every  node  in  the  out-tree.  An 
out-directed  spanning  tree  is  an  out-tree  which  spans  the  network.  An  example  of 
an  out-tree  is  given  in  figure  4. 10. 


Figure  4.10:  An  example  of  an  out-tree 
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We  now  explain  how  every  node  in  the  network  marks  some  of  its  out-going 
links  outtree  during  the  traversal  algorithm  such  that,  when  the  algorithm  terminates 
the  collection  of  outtree  marked  links  constitute  a  spanning  out-tree  of  the  network 
rooted  at  the  traversal  initiating  node. 

To  construct  an  out-tree  we  make  the  following  observation:  The  collection 
of  father  links  constitute  an  out-tree.  To  prove  it  note  the  following:  (1)  Every 
node,  except  the  root,  has  exactly  one  father  incoming  link,  and  (2)  Going  backward 
on  the  path  defined  by  the  father  links  from  any  node  v  we  always  arrive  at  the  root. 
Note  (2)  follows  the  fact  that  the  incoming  father  link  of  any  node  v  connects  v  to  a 
node  which  was  explored  before  v  thus,  we  cannot  close  a  cycle  and  we  must  arrive 
at  the  root 

To  detect  the  father  out-going  links  of  node  v  we  observe  the  following: 
The  traversing  process  leaves  node  v  twice  or  more  through  link  l  (in  Traversal-2) 
while  v  is  active  if  and  only  if  /  is  a  father  link.  Thus,  every  active  node  counts 
whether  or  not  each  of  its  out-going  links  has  been  on  a  backtracking  cycle  more 
than  once.  If  a  link  participated  in  a  backtracking  cycle  more  than  once  while  v  is 
active  then  it  is  a  father  link  and  is  marked  outtree. 

4.6 2.  Applications  of  the  traversal  algorithm 

In  this  section  we  show  how  the  traversal  algorithm  and  its  resulting  infras¬ 
tructure  can  be  used  to  solve  other  problems.  In  particular  we  apply  the  traversal 
algorithm  to  perform  broadcasting,  route  messages,  and  to  systematically  emulate 
any  bidirectional  algorithm  on  a  unidirectional  network.  Each  of  the  applications 
can  be  solved  by  a  traversal;  However,  after  executing  the  traversal  once,  the  appli¬ 
cation  problems  can  be  solved  more  efficiently  by  using  the  infrastructure  produced 


by  the  traversal. 


4.6.2.I.  Broadcast  with  Echo 

The  problem  of  broadcasting  with  echo  can  be  solved  on  a  unidirectional  net¬ 
work  by  a  traversal  algorithm.  In  the  broadcast  with  echo  problem  one  node,  the 
root,  has  a  piece  of  information  which  it  sends  to  all  the  nodes  in  the  network,  and 
the  root  gets  a  positive  acknowledgment  that  all  the  nodes  have  received  the  infor¬ 
mation. 

A  straightforward  solution  to  the  problem  will  use  a  traversing  process  to 
carry  the  information  on  it  The  message  complexity  of  this  solution  is  the  message 
complexity  of  the  traversal  algorithm,  0(n-\E  |). 

After  one  traversal,  the  next  broadcast  with  echo  can  be  more  efficiently  per¬ 
formed  by  traversing  only  the  infrastructure  links.  Since  the  infrastructure  defines  a 
strongly  connected  network  any  node  (not  only  the  root)  may  start  a  traversal  for  this 
purpose.  The  complexity  of  this  traversal  is  O  ( n  2)  since  the  number  of  links  in  the 
infrastructure  is  at  most  2n  -  2. 

After  one  traversal  was  performed,  a  further  improvement  can  be  achieved  as 
follows:  Every  node  which  wants  to  start  a  broadcast  with  echo  first  sends  the 
information  of  the  broadcast  to  the  root  node  along  the  outtree  marked  links,  and 
then  the  root  node  starts  a  broadcast  with  echo  as  described  below.  After  receiving 
the  echo,  the  root  node  will  broadcast  an  echo  on  the  out-tree  links. 

The  infrastructure  can  be  used  for  an  efficient  broadcast  with  echo  from  the 
root  as  follows:  The  root  sends  the  broadcast  message  on  all  its  out-going  links  in 
the  infrastructure.  Upon  receiving  the  broadcast  message  for  the  first  time,  every 
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other  node  sends  the  message  to  all  its  out-going  neighbors  in  the  infrastructure. 
Any  other  copy  of  the  broadcast  message  is  discarded.  This  implements  the  broad¬ 
cast  part  of  the  algorithm.  To  notify  the  root  that  all  the  nodes  in  the  network  have 
received  the  broadcast  message,  every  node  v  sends  an  echo  as  follows:  After 
sending  the  copy  of  the  broadcast  message,  v  sends  an  echo  over  all  of  its  infrastruc¬ 
ture  outgoing  links  except  the  link  marked  intree.  Node  v  sends  an  echo  over  the 
intree  marked  outgoing  link  only  after  an  echo  has  been  received  on  all  of  its  infras¬ 
tructure  incoming  links.  When  the  root  has  received  the  echo  over  all  of  its  infras¬ 
tructure  incoming  links,  the  notification  has  been  completed,  and  a  message  to  this 
effect  is  sent  on  the  outtree  marked  links. 

The  average  message  complexity  of  the  resulting  broadcast  with  echo  algo¬ 
rithm  is  6 n  (averaging  out  the  first  broadcast,  which  is  a  regular  traversal).  Sending 
the  broadcast  message  from  the  initiating  node  to  the  root  costs  at  most  n-1  mes¬ 
sages.  The  broadcast  message  is  then  transmitted  once  over  each  infrastructure  link 
which  adds  at  most  2n-2  messages  to  the  complexity.  The  echo  message  is  also 
sent  once  over  each  infrastructure  link,  hence  another  2n  -2  messages.  Then,  to  echo 
the  initiating  node,  another  n-1  messages  are  transmitted  on  the  outtree  marked 
links. 

4.6.2.2.  Messages  sending 

The  in-  and  out-trees  which  result  from  the  traversal  algorithm  enable  us  to 
efficiently  send  a  message  from  every  node  to  every  other  node.  To  pass  a  message 
from  node  v  to  node  u ,  node  v  sends  the  message  along  the  intree  marked  links  to 
the  root  which  then  broadcast  the  message  on  the  out-tree.  Thus,  at  most  2 n  -2  mes¬ 
sages  are  sent  in  the  routing  mechanism,  in  order  to  send  a  message  from  any  node 


4.6.2.3.  Emulating  bi-directional  distributed  algorithms 


The  above  routing  mechanism  can  be  used  as  a  means  to  emulate  any  bidirec¬ 
tional  distributed  algorithm  on  a  strongly  connected  unidirectional  network.  When¬ 
ever  a  node  has  to  send  a  message  on  an  incoming  link,  it  will  use  the  above  message 
passing  mechanism.  Thus,  if  a  problem  has  a  bit  complexity  O  {P  (n ))  on  a  bidirec¬ 
tional  network,  then  its  complexity  on  the  unidirectional  network  is  upper  bounded 
by  0(rt  P(n)  +  n 2  log n+n  \E  j )  (the  last  two  terms  are  entailed  by  the  construc¬ 
tion  of  the  infrastructure). 


I 


j 

' 

> 

> 

1 


i 


89 


ELECTION  IN  UNIDIRECTIONAL  NETWORKS 


In  this  chapter  we  present  a  distributed  algorithm  for  election  in  strongly  con¬ 
nected  unidirectional  networks.  The  algorithm  requires  0  (log  n )  bits  of  memory  in 
each  processor  and  its  communication  complexity  is  0  (n  •  |  E  |  +n  2log  n )  bits. 

5.1.  Introduction 

The  strongly  connected  unidirectional  network  is  the  most  general  network 
model,  in  the  sense  that  every  network  topology,  bidirectional  or  unidirectional,  can 
be  modeled  as  a  unidirectional  network  by  replacing  any  bidirectional  link  with  two 
anti-parallel  unidirectional  links.  Hence,  any  distributed  algorithm  for  strongly  con¬ 
nected  unidirectional  networks  is  also  an  algorithm  for  any  other  network  model. 

To  design  an  election  algorithm  for  strongly  connected  unidirectional  net¬ 
works  the  traversal  algorithms  of  Chapter  4  can  be  used  in  various  ways.  First,  in  a 
straight  forward  approach,  every  initiator  starts  a  traversal.  When  ever  two  traversals 
meet,  the  lower  id  one  is  destroyed.  The  worst  case  communication  complexity  of 
this  algorithm  is  0((/i  |£  |+n2log«)/i)  bits,  as  O(n)  traversals  could  be  initiated 
such  that  each  spends  0(n  \E  |+n2-logn)  bits.  Second,  the  modular  technique  of 
Korach  et  al.  [Kor85]  could  be  used  to  economically  eliminate  traversals.  Using 
their  technique  the  communication  complexity  is  reduced  to 


No  algorithm  for  election  in  unidirectional  networks  has  come  to  our  atten¬ 
tion  prior  to  the  one  presented  here.  However,  two  bidirectional  algorithms,  the 
shortest  path  algorithm  of  Gallager  [Gal76]  and  the  connectivity  checking  algorithm 
of  Segall  [Seg83]  can  easily  be  modified  into  a  unidirectional  election  algorithm.  In 
[Seg83],  Segall  presents  a  connectivity  checking  algorithm  upon  whose  termination 
every  node  knows  the  ids  of  all  the  other  nodes  connected  to  it  The  shortest  path 
algorithm  in  [Gal76]  exhibits  the  same  property  when  it  terminates.  The  communi¬ 
cation  complexity  of  the  two  algorithms  is  0(n-\E  |  log n )  bits,  and  each  node  is 
assumed  to  have  0  ( n  log  n )  bits  of  memory. 

The  unidirectional  variation  of  the  two  algorithms  proceeds  in  two  phases:  In 
the  first  phase,  every  node  acquires  the  ids  of  its  incoming  neighbors;  in  the  second, 
it  acquires  the  ids  of  all  the  other  nodes  in  the  network. 

Let  an  incoming  neighbor  of  node  v  be  a  node  at  the  other  end  of  an  incom¬ 
ing  link  of  v,  and  let  in-neighbors  of  v  be  the  set  of  all  the  incoming  neighbors  of 
v .  Let  the  record  of  node  v  be  a  two-field  data  structure,  of  which  the  first  contains 
the  id  of  v  and  the  second  the  ids  of  v  ’s  in-neighbors.  In  the  first  phase,  every  node 
transmits  its  id  on  all  its  incident  outgoing  links.  In  the  second  phase,  every  node 
broadcasts  its  record  to  all  the  other  nodes  in  the  network.  For  this  purpose,  each 
node  v  maintains  two  sets  of  ids,  the  received  set  and  the  known  set  The  received 
set  contains  the  ids  of  the  nodes  whose  records  were  already  received  by  v .  The 
known  set  contains  ids  which  appeared  in  a  record  of  at  least  one  node  from  the 
received  set,  i.e.,  ids  of  nodes  whose  existence  is  known  to  v .  Initially,  the  received 
set  of  node  v  contains  the  id  of  v ,  and  the  known  set  contains  the  ids  of  v  and  of  v ’s 
in-neighbors.  Clearly,  when  the  two  sets  in  a  node  are  identical,  they  contain  the  ids 
of  all  the  nodes  in  the  network  (which  can  easily  be  proved  by  induction). 


The  communication  complexity  of  the  algorithm  thus  described  is 
O  ( |  £  1 2-log  n )  bits;  however,  assuming  that  messages  sent  over  one  link  are 
received  in  the  order  transmitted,  the  communication  complexity  can  be  reduced  to 
O  (n  •  |  E  |  -log  n )  bits  by  avoiding  repeated  transmission  of  the  same  id  over  the  same 
link.  Note,  that  for  these  algorithms  each  node  is  assumed  to  have  O  ( n  log  n )  bits 
of  memory. 

In  this  chapter  we  present  an  election  algorithm  for  general  strongly  con¬ 
nected  unidirectional  networks,  whose  communication  complexity  is 
0  (n  •  |  £  |  +  n  2  log  n  )-bits ,  using  O  (log  n )  bits  of  memory  in  each  node.  The  algo¬ 
rithm  yields  two  directed  spanning  trees,  both  rooted  at  the  leader,  one  is  an  incom¬ 
ing  tree,  and  the  other  is  an  outgoing  tree.  The  algorithm  is  thus  an  improvement  on 
the  algorithms  of  Gallager  and  Segall  both  in  terms  of  communication  complexity 
and  in  terms  of  the  number  of  memory  bits  required  at  each  node.  Furthermore, 
unlike  our  algorithm,  neither  SegalTs  nor  Gallager’s  algorithm  provides  the  span¬ 
ning  trees. 

The  election  algorithm  presented  here  is  a  generalization  of  the  traversal 
algorithm  from  the  previous  chapter.  On  one  hand,  the  unidirectional  traversal  algo¬ 
rithm  with  a  predefined  in-tree  (Section  4.3.2.)  is  a  building  block  of  the  election 
algorithm.  On  the  other  hand,  the  traversal  algorithm  with  the  in-tree  construction 
(traversal-2)  can  be  seen  as  a  special  case  of  the  election  algorithm.  The  traversal 
algorithm  is  the  election  algorithm,  under  the  constraint  that  only  one  node  starts  the 
algorithm.  The  resulting  traversal  algorithm  incurs  the  same  communication  cost  as 
the  election  algorithm. 


5.2.  A  Unidirectional  Election  Algorithm 

In  this  section  we  present  a  recursive  distributed  algorithm  for  election  in 
unidirectional  strongly  connected  networks. 

5.2.1.  Definitions  and  Outline 

The  election  algorithm  is  based  on  the  following  recursive  properties  of 
strongly  connected  directed  multigraphs: 

1.  The  set  of  links,  defined  by  selecting  one  outgoing  link  from  every  node, 
contains  a  nonempty  set  of  disjoint  directed  cycles. 

2.  The  subgraph,  obtained  from  G  by  contracting  any  of  the  cycles  defined 
above  into  one  node,  results  in  a  strongly  connected  multigraph. 

3.  Repeated  application  of  the  operations  in  1  and  2  contract  G  into  a  single 
node. 

The  distributed  algorithm  proceeds  in  conceptual  phases,  which  follow  the 
above  contraction  process.  When  a  cycle  is  detected,  its  nodes  are  grouped  into  a 
cluster.  Similar  phases  are  used  in  [Hum83j.  Initially  we  consider  each  node  to  be  a 
single  node  cluster.  The  phases  of  the  algorithm  are:  selection  of  an  outgoing  link 
from  each  cluster,  called  a  selected  link;  detection  of  cycles  among  clusters;  and 
contraction  of  cycles  of  clusters. 

A  cluster  is  recursively  defined  as  follows: 

1.  A  single  node  is  a  cluster. 

2.  A  set  of  clusters  that  are  joined  in  a  ring  by  their  selected  outgoing  links  is  a 


Recursively  we  assume  that  every  cluster  satisfies  the  following  4  properties  (see 
figure  5.1): 

1.  A  unique  node  in  the  cluster  is  distinguished  as  the  cluster  head . 

2.  All  the  nodes  in  the  cluster  know  the  id  of  the  cluster  head  which  is  also  the 
id  of  the  cluster. 

3.  Each  node  in  the  cluster,  except  the  cluster-head,  has  one  outgoing  link 
marked  as  intree  link.  The  collection  of  intree  links  forms  a  directed  incom¬ 
ing  tree,  spanning  the  cluster  and  rooted  at  the  cluster-head. 

4.  A  strongly  connected  subnetwork  which  spans  the  cluster,  called  the  infras¬ 
tructure  of  the  cluster,  is  defined  on  the  cluster. 

Clearly,  a  single  node  cluster  satisfies  the  inductive  assumptions.  It  is:  the 
cluster-head  of  itself;  a  single  node  in-tree;  and  a  strongly  connected  subnetwork. 
To  describe  the  algorithm  we  will  describe  the  inductive  step,  i.e.,  we  assume  that  a 
set  of  clusters  which  satisfy  the  assumptions  already  exists  and  explain  how  a  bigger 
cluster  which  satisfies  the  inductive  assumptions  is  composed  out  of  this  set. 

To  select  a  cluster  outgoing  link,  each  cluster  head  initiates  a  Depth  First 
Traversal  (DFT)  process.  The  traversal  process  is  used  to  search  for  a  link  which  is 
potentially  outgoing  from  the  cluster.  For  the  depth  first  traversal  algorithm  we  use 
the  traversal  algorithm  which  was  developed  in  section  4.3.2. 

To  detect  a  cycle,  we  use  a  simple  algorithm  for  election  on  a  unidirectional 
ring.  Each  cluster  forwards  the  largest  id  it  has  seen.  When  a  cluster  receives  the 

94 


same  id  twice  it  has  detected  a  cycle. 

The  contraction  of  a  cycle  is  accomplished  by  first  electing  one  of  the 
cluster-heads  on  the  cycle  to  be  the  cluster-head  of  the  expanded  cluster.  Then,  the 
newly  elected  cluster-head  synchronizes  the  cluster  by  broadcasting  the  new  cluster- 
id  to  all  the  nodes  and,  constructing  all  the  inductive  requirements  on  the  new  clus¬ 
ter. 

The  most  costly  phase  is  the  DFT  in  a  search  for  a  cluster  outgoing  link.  The 
reason  for  this  is  that  in  the  contraction  phase  we  lose  the  DFT  information  accumu¬ 
lated  by  all  the  clusters  around  the  cycle  except  one.  When  the  cluster-head  initiates 
a  DFT  in  the  next  phase,  the  search  will  have  to  spend  much  effort  regaining  all  the 
lost  DFT  information.  However,  by  selecting  the  cluster-head  of  the  largest  cluster 
on  the  ring  to  be  the  new  cluster-head,  we  minimize  the  amount  of  information  lost. 
Thus,  we  limit  the  rate  of  information  loss  to  the  rate  of  cluster  growth,  (i.e.  if  a 
large  cluster  were  contracted  with  a  small  one,  the  amount  of  information  lost  is  pro¬ 
portional  to  the  size  of  the  small  one).  In  fact,  we  are  able  to  show  that  the  cost  of 
all  the  DFT’s  conducted  during  the  algorithm,  is  within  a  constant  factor  of  the  cost 
of  a  single  DFT.  Since  this  point  is  critical  in  the  complexity  calculation,  but  rather 
minor  to  the  description  of  the  algorithm,  we  postpone  a  detailed  discussion  of  it 
until  the  complexity  section. 

After  a  cluster  is  formed,  its  nodes  are  synchronized  to  search  for  an 
untraversed  link  outgoing  from  the  cluster.  To  achieve  this  synchronization,  the  in¬ 
tree  rooted  at  the  cluster-head  is  used.  When  a  cycle  of  clusters  is  contracted  into  a 
bigger  cluster,  all  their  in-trees  are  merged  into  one  in-tree,  spanning  the  new  cluster. 
The  operations  of  merging  in-trees  and  searching  clusters  utilize  each  other  alter¬ 
nately.  The  structures  left  by  the  DFT’s  are  used  to  modify  and  merge  the  separate 

95 


in-trees  around  the  cycle  into  one  in-tree.  In  turn,  the  in-tree  in  a  cluster  is  used  for 
routing  purposes,  by  its  DFT  process  (see  section  4.3.2. 

In  the  next  three  subsections  we  present  the  three  phases  of  the  algorithm 
starting  with  the  cluster  outgoing  link  selection  (see  figure  5.1).  During  the  algo¬ 
rithm,  links  are  in  one  of  three  states:  new,  elementary  or  killed.  A  new  outgoing 
link  is  one  which  has  not  yet  been  traversed.  An  elementary  link  is  a  link  which  was 
a  cluster  selected  outgoing  link  during  one  of  the  previous  stages.  The  set  of  ele¬ 
mentary  links  within  one  cluster  forms  the  infrastructure  of  the  cluster.  A  killed  link 
is  a  nonelementary  link  already  traversed  during  the  algorithm  (i.e.,  an  intra  cluster 
nonelementary  link). 

5.2 2.  Selection  of  a  Cluster-Outgoing  Link 

Once  a  cluster  is  formed,  its  head  node  initiates  a  DFT  algorithm  to  search 
the  cluster’s  infrastructure  for  a  node  with  an  untraversed  outgoing  link.  The  first 
such  link  to  be  found  is  selected  as  the  cluster’s  outgoing  link.  If  it  turns  out  to  be 
an  intra  cluster  link,  the  DFT  continues  where  it  was  stopped  in  the  search  of  another 
untraversed  outgoing  link.  If  no  cluster  outgoing  link  is  found,  the  cluster  contains 
all  the  nodes  of  the  network,  and  the  algorithm  terminates. 

For  the  completeness  of  the  algorithm  description  we  review  the  essential 
details  of  the  DFT  from  section  4.3.2. 

5.2.2.I.  Distributed  Depth  First  Traversal  of  Unidirectional  Networks 

A  building  block  of  the  election  algorithm  is  the  distributed  Depth  First 
Traversal  (DFT)  of  unidirectional  networks  in  which  an  intree  is  defined.  The  root 
node  of  the  intree  initiates  the  DFT  by  spawning  a  process  which  visits  all  the  nodes 
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Figure  5.1:  Ousters  in  the  Election  Algorithm 
in  the  network.  Upon  arriving  at  a  node  for  the  first  time,  the  process  marks  the 

node  and  recursively  spawns  one  new  child  process,  which  sequentially  visits  each 
of  the  node’s  outgoing  neighbors.  If  a  process  arrives  at  an  already  marked  node,  it 
backtracks  to  its  father.  After  its  child  process  has  backtracked  from  all  the  outgoing 
neighbors,  it  is  killed,  and  the  process  backtracks  to  its  father.  The  traversal  ter¬ 
minates  when  the  child  process  of  the  root  node  is  killed. 


To  perform  the  backtracking  in  a  unidirectional  network,  we  use  the  given 
intree  and  note  two  observations.  Nodes  on  which  live  processes  are  located  form  a 
simple  directed  path.  In  the  sequel  we  call  this  path  an  active  path  and  its  links 
active  links.  The  active  outgoing  link  of  each  process  leads  to  its  child  process. 


The  first  node  on  the  active  path  is  the  root  of  the  intree.  All  backtrackings  are  from 
the  last  node  on  the  active  path  to  its  father.  Thus,  to  accomplish  backtracking,  a  pro- 
cess  follows  the  unique  path  of  the  predefined  intree,  from  its  location  to  the  root. 
From  the  root,  the  process  follows  the  active  path  to  its  father  (whose  identity  each 
process  remembers). 

Each  backtracking  performed  requires  at  most  2 n  messages  of  logn  bits 
each.  Since  there  are  |  E  |  backtrackings,  the  total  communication  cost  of  the  DFT  is 
O  (n  •  |  E  |  log  n )  bits. 

Note  that  at  any  given  time,  all  but  one  of  the  live  processes  are  waiting  for 
their  child  processes  to  terminate.  The  last  process,  which  has  no  child  and  is 
actively  visiting  nodes,  is  called  the  focal  point  of  the  DFT.  The  focal  point  is 
analogous  to  the  stack  pointer  in  the  centralized  Depth  First  Search  algorithm 
[Tar72]. 

We  can  view  the  DFT  as  a  token  traversal  scheme  in  which  the  token  location 
is  the  focal  point  of  the  DFT.  In  the  algorithm  each  cluster  will  employ  a  DFT  to 
choose  one  outgoing  link  as  the  cluster’s  selected  outgoing  link. 

5.22.2.  Selecting  a  cluster  outgoing  link 

The  cluster  outgoing  link  selection  phase  begins  after  the  synchronization  of 

the  cluster  has  terminated.  The  head  node  of  the  new  cluster  initiates  a  DFT  token 

on  the  cluster’s  infrastructure.  The  token  carries  the  id  of  the  cluster  which  created  it 

and  the  maximum  cluster-id  observed  so  far  (maximum-id).  The  cluster-id  is  used  to 

distinguish  between  inter-  and  intra-cluster  links,  and  the  maximum-id  is  used  to 

detect  a  cycle  of  clusters.  The  DFT  token  searches  the  cluster  for  a  new  link,  i.e.,  a 

link  never  used  by  the  algorithm.  Doing  so,  the  DFT  token  leaves  behind  a  trail  of 
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active  links,  which  leads  from  the  cluster-head  to  the  token  location  (which  is  the 
traversal  focal  point). 

Upon  finding  a  new  outgoing  link,  / ,  the  token  traverses  link  /  to  node  v  on 
the  other  end  of  /.  If  v  belongs  to  the  same  cluster,  the  token  returns  to  /’s  tail  via 
the  in-tree  and  the  active  path.  Link  l  is  then  marked  killed,  and  it  will  never  again 
be  traversed.  If,  on  the  other  hand,  the  token  arrives  at  a  different  cluster,  the  infor¬ 
mation  it  carries  and  the  newly  selected  link  it  has  established  enable  the  cycle  detec¬ 
tion  to  continue  as  described  below. 

5.23.  Cycle  Detection 

For  the  sake  of  simplicity,  we  have  selected  an  inefficient  algorithm  for  cycle 
detection.  Since  its  complexity  is,  in  general,  not  the  bottleneck,  we  have  avoided 
discussing  an  improved  mechanism  for  cycle  detection  (The  improvements  are  a 
generalization  of  [Pet82,  Dol82],  with  which  our  algorithm  will  perform  optimally 
on  rings). 

After  each  cluster  selects  an  outgoing  link,  the  network  contains  at  least  one 
cycle  which  consists  of  two  or  more  clusters  (see  figure  5.1).  Let  each  cluster  send 
its  id  on  the  cluster  outgoing  link.  A  cluster  forwards  another  cluster-id  only  if  it  is 
larger  than  all  the  cluster-ids  it  has  received  in  the  past.  Eventually,  one  and  only 
one  cluster  in  each  cycle  will  receive  the  same  cluster-id  twice,  thus  detecting  a 
cycle.  The  cluster-head  that  detected  the  cycle  synchronizes  the  new  cluster. 

The  operation  of  cycle  detection  is  carried  out  by  the  cluster-heads.  To 

implement  it,  each  node  receiving  a  message  from  a  different  cluster  forwards  it  to 

its  cluster-head  through  the  cluster’s  in-tree.  To  forward  a  maximum- id  from  a 

cluster-head  to  the  next  cluster,  the  cluster-head  broadcasts  the  maximum-id  over  the 
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infrastructure  of  its  cluster.  All  nodes  retain  a  maximum-id  variable,  which  is 
updated  by  the  maximum-id  of  the  broadcast  message.  The  DFT  token  updates  its 
maximum-id  to  the  largest  it  encounters  along  its  way.  If  a  cluster  outgoing  link  has 
been  selected,  the  broadcast  message  is  simply  forwarded  over  the  outgoing  link  to 
the  next  cluster. 

When  a  cluster-head  receives  a  maximum-id  which  is  equal  to  its  own,  it  has 
detected  a  cycle  and  it  is  elected  to  start  the  synchronization  phase. 

5.2.4.  Cycle  Contraction  and  Cluster  Synchronization 

In  the  previous  phase  a  new  cluster  with  an  elected  cluster-head  was  found. 
In  this  phase  the  elected  cluster-head  satisfies  the  remaining  three  inductive  assump¬ 
tions  on  the  new  cluster.  The  elected  cluster-head  thus;  (1)  notifies  all  the  nodes  in 
the  clusters  around  the  cycle  of  their  new  cluster-id,  (2)  It  combines  the  in-trees  of 
all  the  clusters  into  one  in-tree  which  spans  the  new  cluster  and  is  rooted  at  the 
elected  cluster-head,  and  (3)  It  combines  the  infrastructures  into  one  infrastructure 
spanning  the  new  cluster.  After  receiving  a  positive  acknowledgment  that  all  the 
nodes  have  completed  these  constructions,  the  elected  cluster-head  starts  the  next 
phase  of  cluster  outgoing  link  selection.  The  synchronization  phase  is  carried  out  by 
a  broadcast  with  echo  mechanism  on  the  cluster  new  infrastructure. 

Upon  receiving  the  first  copy  of  the  broadcast  message,  every  node  performs 
the  following  five  operations  in  the  following  order:  (1)  updates  its  own  cluster-id; 
(2)  It  marks  its  cluster-outgoing  link  (if  it  has  one)  as  elementary;  (3)  It  updates  (if 
necessary)  its  intree  mark;  (4)  It  forwards  copies  of  the  broadcast  message  over  its 
elementary  outgoing  links;  and  (5)  It  acknowledges  the  reception  of  the  message 
through  the  in-tree.  Any  duplicate  copies  of  the  message  are  discarded. 


The  second  operation  above,  has  combined  the  infrastructures  into  one  infras¬ 
tructure  which  spans  the  new  cluster. 

5.2.4. 1.  Merging  the  in-trees 

To  merge  all  the  in-trees  around  the  cycle,  we  use  the  active  paths  in  each 
cluster  which  lead  from  the  cluster-heads  to  the  head  nodes  of  the  selected  outgoing 
links.  The  active  paths  are  constructed  during  the  DFT  in  the  clusters  outgoing  link 
selection  phase. 

We  notice  that,  if  each  node  of  a  cluster  that  has  an  active  outgoing  link  puts 
its  intree  mark  on  the  active  link,  then  the  incoming  spanning  tree  is  rerooted  to  the 
head,  node  of  the  cluster  outgoing  link  (see  figure  5.2).  (Note  that  the  head  node  of 
the  cluster  outgoing  link  belongs  to  the  next  cluster  on  the  cycle  of  clusters.)  To 
obtain  a  directed  in-tree  which  spans  all  the  clusters  around  the  cycle,  we  perform 
this  operation  in  all  the  clusters  around  the  cycle  except  for  the  elected  cluster. 
Thus,  the  in-tree  formed  is  rooted  at  the  elected  cluster-head.  This  in-tree  is  used  by 
the  nodes  to  notify  their  new  cluster-head  of  the  contraction  termination. 

5.2.4.2.  Acknowledging  the  broadcast 

To  notify  the  cluster-head  that  all  the  nodes  in  the  new  cluster  are  aware  of 
their  new  cluster-id,  every  node  sends  an  acknowledgment  as  follows:  After 
receiving  the  first  copy  and  making  all  the  necessary  updates,  every  node  sends  an 
acknowledgment  over  all  the  elementary  outgoing  links  aside  from  the  intree  marked 
link.  An  acknowledgment  is  sent  over  the  intree  marked  outgoing  link  only  after  an 
acknowledgment  has  been  received  on  all  the  elementary  incoming  links. 
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Figure  5.2:  Rerooting  an  in-tree 

When  the  elected  cluster-head  has  received  the  acknowledgment  message 
over  all  of  its  elementary  incoming  links,  the  contraction  has  terminated,  and  a  new 
phase  of  cluster  outgoing  link  selection  is  started. 
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5.2.5.  Termination 


The  algorithm  terminates  when  a  cluster  fails  to  select  a  cluster-outgoing 
link.  This  cluster  spans  the  whole  network,  its  cluster-head  is  the  elected  node  and 
its  maximum-id  is  the  largest  id. 
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5.3.  Complexity  of  the  Election  Algorithm 

Communication  complexity  analysis  involves  counting  the  total  number  of 
bits  transmitted  over  all  the  network  links.  The  communication  cost  of  the  algorithm 
has  three  components:  the  cost  of  cluster  synchronizations,  the  cost  of  cycle  detec¬ 
tions  and  the  cost  of  cluster-outgoing  link  selections. 
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We  observe  the  following  two  facts. 


Fact  1 :  The  number  of  cycle  detections  is,  at  most,  n  - 1 . 

Fact  2  :  In  a  cluster  of  size  k ,  there  are,  at  most  2k -l  elementary  links. 

Fact  1  holds  because  the  contraction  of  a  cycle  strictly  reduces  the  network 
size.  Fact  2  follows  from  fact  1  and  the  observation  that  there  exists  a  one-to-one 
correspondence  between  clusters  and  elementary  links. 

5.3.1.  Cluster  Synchronization  Cost 

Lemma  5.1:  The  total  cost  of  synchronizing  the  clusters  is  at  most  0(nz  log  n) 
bits. 

proof:  According  to  fact  1  there  are,  at  most,  n-1  cluster  synchronizations.  The 
synchronization  messages  propagate  on  the  infrastructure  of  a  cluster  which  con¬ 
tains,  at  most,  2/j-l  links.  In  each  synchronization  one  broadcast  message  and  one 
echo  message  are  transmitted  on  each  elementary  link.  Since  each  message  is  of  size 
O  (log  n )  bits  the  result  follows.  ■ 


5.3.2.  Cycle  Detection  Cost 

Lemma  5.2:  The  total  cost  of  all  cycle  detections  is  at  most  O  {n  2  log  n )  bits. 

proof:  Each  time  a  new  cluster  head  is  elected  a  new  maximum-id  is  sent  around  a 
cycle  of  clusters.  Thus,  by  fact  2,  there  are  at  most  2n-2  maximum-id  initiations. 
Each  such  initiation  is  sent  over  the  infrastructure  of  some  cluster.  The  same 
maximum-id  is  forwarded  at  most  three  times  on  the  same  link  in  a  particular  clus¬ 
ter;  Once  when  it  enters  the  cluster  and  the  link  is  an  in-tree  link  on  which  it  was  for¬ 
warded  to  the  cluster-head;  Once  when  the  cluster-head  broadcasts  the  maximum-id; 
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and  once  on  its  return  to  that  cluster-head  (which  then  detects  a  cycle).  Since,  a  link 
forwards  the  same  maximum-id  at  most  three  times  for  each  cluster  it  belongs  to,  the 
result  follows.  ■ 

5.3  J.  The  Cost  of  Cluster  Outgoing  Link  Selection 

The  cost  involved  in  selecting  outgoing  links  consists  of  the  cost  of  killing 
intercluster  links  and  the  cost  of  the  DFT’s. 

To  kill  link  /,  the  algorithm  transmits  one  message  of  0(log  n)  bits  over  / 
and  a  kill  message  of  size  0(1)  bits  over  a  path  from  the  head  of  l  to  its  tail.  The  kill 
message  goes  down  the  in-tree  to  the  cluster-head  and  then  along  the  active  path  to 
the  tail  of  / .  This  node  is  distinguished  from  other  nodes  on  the  active  path  since  it 
is  the  focal  point  of  the  DFT.  Thus,  the  killing  of  one  link  costs  O  (n )  bits.  Since  the 
algorithm  kills,  at  most,  |  E  |  links,  the  killing  of  intercluster  links  adds  up  to,  at 
most,  0(n-\E  | )  bits. 

As  mentioned,  the  cost  of  one  DFT  on  a  network  with  n  nodes  and  |  E  |  links 
is  O  (n  •  |  E  |  -log  n )  since  there  is  (me  backtracking  for  each  link  of  the  network,  and 
each  backtracking  costs  0 (n  log  n).  The  DFT  operation  is  employed  in  the  election 
algorithm  to  search  the  infrastructures  of  different  clusters.  Since  there  are  at  most 
twice  as  many  links  as  nodes  in  an  infrastructure,  the  cost  incurred  by  each  DFT  of  a 
cluster  with  k  nodes  is  0(k2  log  n).  As  the  DFT  could  be  used  n- 1  times  by  the 
algorithm,  the  total  cost  of  all  DFT’s  might  be  O  (n3  log  n ). 

To  reduce  the  total  cost  of  all  DFT’s  from  0(n3logn)  to  0(n2\ogn)  bits,  a 

special  cluster  head  election  phase  is  added  after  the  cycle  detection  and  before  the 

synchronization  phases.  In  this  phase  the  cluster  head  which  is  elected  in  the  cycle 

detection  phase  synchronizes  the  election  of  the  cluster-head  of  the  largest  size 
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cluster  around  the  cycle.  The  new  cluster-head  then  proceeds  with  the  cluster  syn¬ 
chronization  phase  as  described  before.  After  synchronizing  the  cluster,  the  cluster- 
head  resumes  its  DFT  process  from  the  previous  stage,  i.e.,  from  the  head  node  of  its 
former  cluster  outgoing  link,  thus  avoiding  re- searching  nodes  that  were  already 
searched.  The  head  node  of  its  former  cluster  outgoing  link  is  now  the  last  node  on 
the  active  path. 

5.33.1.  Cluster-Head  Election 

To  elect  a  cluster-head  of  a  largest  size  cluster  (ties  are  resolved  by  cluster- 
ids),  the  cluster-head  detecting  the  cycle  sends  an  elect-message  around  the  alternat¬ 
ing  sequence  of  active  paths  and  in-trees  which  form  a  directed  cycle  (see  figure 
3.1).  On  its  way,  the  elect-message  finds  out  which  cluster  has  the  greatest  number 
of  nodes  and  what  the  total  number  of  nodes  in  all  the  clusters  around  the  cycle  is. 
This  information  is  updated  on  the  elect-message  by  the  cluster-heads  along  the 
directed  cycle. 

Once  the  elect  message  returns  to  the  originating  cluster-head,  the  control  of 
cluster  synchronization  is  passed  to  the  newly  elected  cluster-head.  Any  new  and 
larger  maximum-id  which  arrives  at  the  cluster-head  detecting  the  cycle  during  the 
election  and  the  synchronization  phase  is  held  back  by  this  node.  It  is  forwarded  to 
the  new  cluster-head  upon  the  reception  of  the  synchronization  broadcast. 

After  being  elected,  the  new  cluster-head  resumes  its  DFT  of  the  previous 
cluster  outgoing  link  selection  phase.  Doing  so  it  uses  the  active  path  and  the  node 
marks  which  its  DFT  had  left.  The  algorithm  thus  can  be  viewed  as  a  process  in 
which  all  the  cluster-heads  are  candidates  for  leadership.  When  clusters  owned  by 
different  candidates  form  a  cycle,  the  candidate  which  owns  the  largest  size  cluster 


eliminates  all  the  other  candidates.  In  doing  so  the  candidate  merely  has  enlarged  its 
cluster  to  include  the  clusters  of  the  candidates  it  has  eliminated.  The  DFT  structures 
it  had  previously  are  then  extended  to  search  the  enlarged  cluster.  Henceforth,  we 
refer  to  the  clusters  which  did  not  change  their  cluster-id  as  one  cluster  which  has 
consumed  other  clusters  during  the  algorithm. 

The  above  scheme  is  similar  in  principle  to  the  capturing  rule  of  algorithm  A 
in  chapter  2  (Section  2.5).  There  we  enabled  the  largest  candidate,  in  terms  of  cap¬ 
tured  nodes,  to  kill  and  take  nodes  from  a  smaller  candidate.  To  see  that  the  above 
scheme  does  not  add  more  than  a  constant  factor  to  the  complexity  of  a  single  DFT, 
we  will  use  a  lemma  similar  to  Lemma  2.2  and  which  was  introduced  in  a  similar 
context  by  Gallager:  [Gal77] 

Lemma  5.3:  For  any  given  k  ,  the  number  of  clusters  that  own  n/k  nodes  or  more 
is,  at  most,  k . 

Proof:  Let  C  i  and  C  2  be  any  two  clusters  which  had  size  n  Ik  at  some  point  of  time. 
We  shall  show  that  each  of  C  \  and  C2  must  have  had  nlk  nodes  disjointly.  If  they 
have  never  consumed  each  other,  we  are  done,  since  the  clusters  were  certainly  dis¬ 
joint  If,  w.l.o.g.,  C  i  has  consumed  C2,  then  C  [  must  have  already  had  n/k  nodes  at 
the  time  of  eliminating  C  2.  ■ 

Corollary  :  The  largest  cluster  to  be  consumed  by  another  cluster  owns  at  most  n  12 
nodes,  the  next  largest  at  most  nl 3  ,  etc. 

Thus,  we  arrive  at  the  following: 

Lemma  5.4:  The  total  cost  of  the  DFT’s  is  O  (n 2  log  n ). 

proof:  The  cost  of  traversing  a  cluster  of  size  k  is  at  most  k  2  log  n  bits.  Hence,  the 
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Adding  the  bit  costs  of  all  three  components,  we  arrive  at  a  total  communica¬ 
tion  complexity  of  O  ( n  2log  n  +  n  •  |  £  | )  bits  for  the  whole  algorithm.  By  the  argu¬ 
ments  presented  in  section  4.5  we  conjecture  that  this  is  also  the  lower  bound  on  the 
communication  complexity  of  the  election  problem  on  arbitrary  topology  str<  rely 
connected  unidirectional  networks. 


5.4.  The  Traversal  Algorithm  as  a  Special  Case  of  the  Election  Algorithm 

In  this  section  we  show  how  Traversal-2  and  -3  can  be  derived  as  a  special 
case  of  the  election  algorithm.  Imagine  the  behavior  of  the  election  algorithm  when 
it  is  started  by  a  single  node.  In  this  case,  the  initiator  initiates  a  process  which  visits 
all  the  nodes  in  the  network  and  terminates.  In  the  first  subsection  we  show  that  the 
process  can  be  modified  to  behave  exactly  as  the  process  in  Traversal-2.  In  subsec¬ 
tion  5.4.2  we  show  how  Traversal-3  can  be  derived  from  the  election  algorithm. 

The  process  of  deriving  the  traversal  algorithms  from  the  election  algorithm 
provides  a  constructive  proof  of  lemma  4.6. 

5.4.1.  Deriving  Traversal-2  from  the  election  algorithm 

Assuming  that  only  one  node  initiates  the  election  algorithm,  we  make  the 
following  observations: 
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First,  at  any  given  time,  at  most  one  cluster  is  searched,  i.e.,  no  messages  are 
exchanged  in  any  of  the  other  clusters. 

Second,  all  the  clusters  and  their  selected  outgoing  links  form  a  simple 
directed  path  in  which  each  cluster  is  a  node  and  each  link  is  a  selected  outgoing 
link.  This  path,  called  the  clusters  active  path,  occasionally  closes  on  itself  (see 
figure  5.3).  When  the  path  closes  either  a  cycle  of  clusters  is  formed,  or  the  last  link 
on  the  path  is  an  intra  cluster  link  (in  the  last  cluster  on  the  path,  see  figure  5.3,  cases 
2  and  1,  respectively). 

Third,  consider  the  cluster’s  active  path  when  it  does  not  close  on  itself. 
Then,  the  nodes  at  which  the  cluster’s  active  path  enters  clusters  are  the  first  nodes  to 
be  explored  in  each  cluster.  Therefore,  if  these  nodes  were  selected  as  the  cluster- 
heads  of  their  clusters,  the  active  paths  of  all  the  clusters  would  form  a  single  con¬ 
tiguous  path  at  all  times. 

We  now  modify  the  election  algorithm  according  to  the  above  observations 
namely,  in  each  cluster  the  first  node  to  be  explored  is  elected  as  the  cluster-head. 
Cycle  detection  (case  2  in  figure  5.3)  occurs  when  the  token  leaves  one  cluster  C  t 
(by  traversing  its  selected  outgoing  link  for  the  first  time)  and  arrives  at  another,  pre¬ 
viously  explored,  cluster  C  2.  Recalling  the  third  observation,  the  cluster-head  of  C  2 
is  the  "oldest"  node  on  the  newly  formed  cycle  of  clusters.  This  cluster-head  ( h ) 
synchronizes  the  cycle  of  clusters  and,  is  elected  to  be  the  cluster-head  of  the  new 
cluster.  Also,  (from  the  third  observation)  the  active  paths  around  the  cycle  form  a 
single  contiguous  path.  If  the  synchronization  phase  is  modified  to  leave  all  the 
active  paths  intact,  the  distinct  DFT’s  may  be  considered  as  one  DFT.  Thus  the 
newly  elected  cluster-head,  h  resumes  a  DFT  which  already  has  an  active  path  going 

through  all  the  clusters  around  the  cycle.  As  a  result,  no  node  in  the  network  is 
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searched  more  than  once  during  the  whole  algorithm.  The  cluster-head  resumes  the 
DFT  at  the  node  in  which  the  cycle  was  detected  (the  LOOP  node  in  figure  5.3,  case 


The  above  variation  of  the  election  algorithm  can  be  viewed  as  a  traversal 
process.  One  node  spawns  the  process  which  terminates  at  the  initiator  after  visiting 
all  the  nodes  in  the  network,  one  at  a  time.  This  traversal  process  can  be  further 
modified  to  work  on  a  unidirectional  network  of  finite  automata,  as  we  show  in  the 
next  section. 


LEGENO 

•  focal  point 
O  Node 
•¥•  LOOP  node 
■  Cluster-head 
— ■»  Intree  link 
-*■  Active  link 

Cluster  outgoing  link 


Case  2 


Case  i 


Figure  5.3:  Clusters  in  the  Traversal  Algorithm 
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5.4.2.  Deriving  Traversai-3  from  the  election  algorithm 

Two  operations  in  the  preceding  traversal  algorithm  require  O  (log  n )  bits 
memory  in  each  node.  The  first  is  distinguishing  an  intra  cluster  from  an  intercluster 
link.  The  second  (which  occurs  during  the  DFT  of  the  infrastructure)  is  backtracking 
from  the  last  node  on  the  active  path. 

In  the  first  operation,  the  0  (log  n )  bits  are  used  to  distinguish  between  the 
case  that  a  newly  traversed  link,  /,  is  an  intra  cluster  link  and  the  case  that  /  closed  a 
cycle  of  clusters.  This  operation  was  accomplished  in  the  preceding  algorithm  by 
comparing  the  id  carried  by  the  token  with  the  cluster-id  of  the  head  node  of  / .  To 
perform  the  operation  without  ids,  we  note  that,  in  both  cases,  /  closed  a  directed 
cycle  of  nodes  and  links  (composed  of  an  active  path  followed  by  a  path  in  an  in¬ 
tree,  see  figure  5.3).  In  the  first  case,  exactly  one  cluster-head  resides  on  the  cycle 
while,  in  the  second  case,  at  least  two  cluster-heads  reside  on  the  cycle  (see  figure 
5.3). 

We  now  explain  how  the  token  decides  whether  there  is  more  than  one 

cluster-head  on  the  cycle.  The  last  node  on  the  active  path  in  each  cluster  is  marked 

focal  point  (it  is  the  focal  point  of  that  cluster’s  DFT,  see  section  3.2).  The  head 

node  of  a  newly  traversed  link  /  is  marked  LOOP  if  it  is  an  already  explored  node. 

Upon  arriving  at  a  LOOP  node,  the  token  is  sent  around  the  cycle  to  find  out  whether 

there  is  more  than  one  cluster-head  on  it.  Since  there  is  exactly  one  LOOP  node  on 

such  a  cycle,  the  walk  around  it  utilizes  a  finite  number  of  bits  on  the  token.  If 

exactly  one  cluster-head  was  found,  the  LOOP  mark  is  removed.  The  token  then 

walks  around  to  the  focal  point,  kills  /  and  resumes  the  DFT.  If  on  the  other  hand, 

more  than  one  cluster-head  was  found,  a  cycle  of  clusters  was  identified.  The  token 

then  makes  another  trip  around  the  cycle  in  order  to  synchronize  its  clusters.  On  the 
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second  round,  the  token  first  erases  all  the  focal  point  marks.  Second,  it  erases  all  the 
cluster-head  marks,  aside  from  the  first.  Third,  the  token  modifies  the  in-trees  of  ail 
clusters,  aside  from  the  first  one,  to  be  rerooted  at  the  LOOP  node  (in  the  same  way 
as  was  done  in  the  election  algorithm).  Notice  that  the  active  path  of  the  first  cluster 
lies  between  the  first  and  the  second  cluster-head.  Arriving  at  the  LOOP  node  for 
the  second  time,  the  synchronization  phase  terminates.  The  LOOP  mark  is  removed, 
the  focal  point  mark  is  put  on,  and  the  DFT  is  resumed. 

The  backtracking  in  the  DFT  is  performed  without  using  node  ids  by  using 
the  same  solution  that  was  suggested  in  section  4.4.2. 

The  communication  cost  of  the  cycle  detection  and  synchronization  phases 
does  not  change  in  the  modified  algorithm.  The  cost  of  the  cluster  outgoing  link 
selection  phases  is  O  (n  2log  n )  bits  since  the  DFT  searching  for  outgoing  links 
requires  one  backtracking  for  each  link  of  the  infrastructure.  Thus,  the  communica¬ 
tion  complexity  of  the  traversal  algorithm  is  the  same  as  that  of  the  election  algo¬ 
rithm. 

5.5.  Concluding  Remarks 

The  election  algorithm  can  be  modified  to  produce  an  out-tree  in  the  same 
way  that  Traversal-2  and  -3  were  modified  in  the  previous  chapter.  The  combined 
structure  of  the  in-tree  and  out-tree  can  be  used  in  many  different  ways  as  was  sug¬ 
gested  in  the  previous  chapter. 

As  shown  in  [Gaf84],  any  algorithm  which  requires  common  knowledge  is 
equivalent  to  an  election  algorithm.  Therefore,  we  expect  the  election  and  traversal 
algorithms  to  serve  as  building  blocks  in  many  unidirectional  network  algorithms. 
An  example  of  such  an  application,  which  involves  termination  detection,  can  be 


found  in  [Mis83]. 


An  interesting  observation  is  that  the  amount  of  communication  in  the  uni¬ 
directional  election  algorithm,  0(n-\E  |  +  n2,log  n)  bits,  is  n  times  the  number  of 
messages  in  the  optimal  bidirectional  algorithm.  We  would  obtain  the  same  cost  if 
we  were  to  simulate  the  bidirectional  algorithm,  [Gal83]  with  each  acknowledgment 
charged  as  n  bits,  on  a  unidirectional  network.  Together  with  lemma  4.9,  this  leads 
to  the  conjecture  that  our  election  algorithm  is  as  efficient  as  possible  in  terms  of 
communication  cost 
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